HyFormer: Hybrid Grouping-Aggregation Transformer and Wide-Spanning CNN for Hyperspectral Image Super-Resolution

Ji, Yantao; Shi, Jingang; Zhang, Yaping; Yang, Haokun; Zong, Yuan; Xu, Ling

doi:10.3390/rs15174131

Open AccessArticle

HyFormer: Hybrid Grouping-Aggregation Transformer and Wide-Spanning CNN for Hyperspectral Image Super-Resolution

by

Yantao Ji

¹

,

Jingang Shi

^1,*,

Yaping Zhang

¹,

Haokun Yang

¹,

Yuan Zong

² and

Ling Xu

³

¹

School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China

²

Key Laboratory of Child Development and Learning Science, Southeast University, Nanjing 211189, China

³

School of Human Settlements and Civil Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(17), 4131; https://doi.org/10.3390/rs15174131

Submission received: 17 June 2023 / Revised: 12 August 2023 / Accepted: 17 August 2023 / Published: 23 August 2023

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Hyperspectral image (HSI) super-resolution is a practical and challenging task as it requires the reconstruction of a large number of spectral bands. Achieving excellent reconstruction results can greatly benefit subsequent downstream tasks. The current mainstream hyperspectral super-resolution methods mainly utilize 3D convolutional neural networks (3D CNN) for design. However, the commonly used small kernel size in 3D CNN limits the model’s receptive field, preventing it from considering a wider range of contextual information. Though the receptive field could be expanded by enlarging the kernel size, it results in a dramatic increase in model parameters. Furthermore, the popular vision transformers designed for natural images are not suitable for processing HSI. This is because HSI exhibits sparsity in the spatial domain, which can lead to significant computational resource waste when using self-attention. In this paper, we design a hybrid architecture called HyFormer, which combines the strengths of CNN and transformer for hyperspectral super-resolution. The transformer branch enables intra-spectra interaction to capture fine-grained contextual details at each specific wavelength. Meanwhile, the CNN branch facilitates efficient inter-spectra feature extraction among different wavelengths while maintaining a large receptive field. Specifically, in the transformer branch, we propose a novel Grouping-Aggregation transformer (GAT), comprising grouping self-attention (GSA) and aggregation self-attention (ASA). The GSA is employed to extract diverse fine-grained features of targets, while the ASA facilitates interaction among heterogeneous textures allocated to different channels. In the CNN branch, we propose a Wide-Spanning Separable 3D Attention (WSSA) to enlarge the receptive field while keeping a low parameter number. Building upon WSSA, we construct a wide-spanning CNN module to efficiently extract inter-spectra features. Extensive experiments demonstrate the superior performance of our HyFormer.

Keywords:

hyperspectral image; super-resolution; transformer

1. Introduction

Hyperspectral images (HSIs) typically consist of tens or even hundreds of spectral bands, providing a richer source of spectral information. This abundance of information enables hyperspectral images to effectively distinguish different objects, materials, and land cover types. Thanks to these characteristics, hyperspectral images have been widely applied in numerous computer vision tasks, including medical image processing [1,2], object tracking [3], mineral exploration [4], hyperspectral anomaly detection [5], plant detection [6], etc. Hyperspectral imaging typically requires high spectral resolution, resulting in a large amount of data that needs to be collected. However, due to the hardware limitations of sensors and acquisition equipment, spatial resolution is often sacrificed to obtain higher spectral resolution. To better serve downstream visual tasks, enhancing the spatial resolution of hyperspectral images has become a crucial research topic.

Hyperspectral super-resolution is an effective technique for enhancing the spatial resolution of hyperspectral images, allowing for the reconstruction of low-resolution hyperspectral images into higher-resolution hyperspectral images. Hyperspectral super-resolution methods can be categorized into two types based on their use of auxiliary information: (1) fusion-based hyperspectral image super-resolution. (2) Single hyperspectral image super-resolution. The former requires additional auxiliary information, such as RGB, panoramic images (PAN) [7,8], multispectral images (MSI) [9,10], etc., to enhance spatial resolution, while the latter relies solely on a single low-resolution hyperspectral image to restore its corresponding high-resolution counterpart. While fusion-based hyperspectral image super-resolution can achieve good performance, it is limited by the requirement of acquiring auxiliary images in the same scene as the low-resolution image, which increases the complexity of the task. Therefore, research on single-image super-resolution is more aligned with the practical needs of real-world scenarios.

Over the past few decades, a multitude of remarkable single hyperspectral image super-resolution methods have been proposed. Akgun et al. [11] improved the spatial resolution of hyperspectral images by modeling the image acquisition process. Li et al. [12] considered the sparsity of spectral decomposition and the repetitiveness of spatial-spectral blocks while proposing a hyperspectral super-resolution architecture based on spectral mixture and spatial-spectral group sparsity. Wang et al. [13] proposed a tensor-based super-resolution method that models the intrinsic characteristics of hyperspectral images. However, these methods rely on handcrafted priors, which are difficult to optimize. Recently, with the rapid advancement of deep learning, deep neural networks have showcased remarkable non-linear fitting capabilities. Consequently, natural super-resolution methods [14,15,16,17,18,19,20,21] based on convolutional neural networks (CNNs) have garnered significant attention. Unlike natural images, hyperspectral images exhibit spectral continuity, where neighboring spectral bands are often correlated. Therefore, 3D convolution is frequently employed to extract features from hyperspectral images. Mei et al. [22] constructed a fully 3D convolutional neural network to perform hyperspectral super-resolution. Yang et al. [23] proposed a multi-scale wavelet 3D CNN by modulating wavelets with 3D CNN to improve the restoration of details. While 3D CNN has powerful representational capabilities, it is often accompanied by a substantial computational burden and a large number of parameters. Thus, Li et al. [24] proposed MCNet, which stacks hybrid modules modulating 2D and 3D convolutions to extract both spatial and spectral features. Furthermore, Li et al. proposed ERCSR to address the issue of parallel structure redundancy in MCNet [25]. Zhang et al. [26] proposed a multi-scale network that utilizes wavelet transform and multi-scale feature fusion to learn features across different spectral bands. In ref. [27], researchers designed a multi-domain feature learning strategy based on 2D/3D units to integrate information from different layers. Tang et al. [28] proposed FRLGN, which incorporates a feedback structure to propagate high-level information for guiding the generation of low-level features. Zhang et al. [29] explored the coupling relationship between the spectral and spatial domains, then utilized spectral high-frequency information to improve channel and spatial attention.

While CNN-based methods have achieved impressive results, they still face the following issue: the widespread use of convolution kernels with a size of 3 in these CNN-based methods severely limits the receptive field range, hindering the model from considering a wider range of contextual information. While it is theoretically expected that the receptive field increases as the network deepens, the effective receptive field is often much smaller than the anticipated result [30] in practice. Directly increasing the kernel size could improve the receptive field size. However, it also leads to a significant increase in parameters and computational complexity. This is especially impractical for 3D convolutions.

Recently, transformers [31] in the field of natural language processing have gained increasing popularity due to their powerful long-range modeling capabilities. Moreover, ViT [32] has successfully extended the application of transformers to the field of computer vision, leading to the emergence of numerous outstanding vision transformer models [33,34,35,36,37,38,39,40,41]. Vision transformers are primarily designed to capture long-range dependencies in the spatial domain. However, as shown in Figure 1, each spectral band of a hyperspectral image exhibits significant sparsity in the spatial domain, with many regions lacking meaningful information. Performing vanilla self-attention calculations on sparse spatial domains would result in significant wastage of computational resources. Consequently, vision transformers originally designed for natural images need to be further improved to adapt to hyperspectral images.

In this paper, we design a hybrid architecture named HyFormer to integrate the advantages of CNN and transformer for hyperspectral super-resolution tasks. CNN and transformer extract features from different perspectives. Specifically, the transformer branch could achieve intra-spectra interaction for fine-grained contextual details on each specific wavelength. The CNN branch helps to conduct efficient inter-spectra feature extraction among different wavelengths while maintaining a large receptive field. By effectively modulating the two types of features, their advantages can complement each other, thereby enhancing the modeling capability of the model.

In the transformer branch, we present a novel Grouping-Aggregation transformer (GAT) that aims to capture intra-spectra interactions of object-specific contextual information from a spectral perspective. By decomposing the spectra of each wavelength, diverse contextual details of objects are implicitly expressed in different channels. The proposed GAT considers each channel as an individual token and focuses on modeling long-range dependencies among different details within the spectra of the same wavelength. Moreover, self-attention can consider all positions in the spatial domain to further capture global dependencies. During the computation of self-attention, the GAT employs a novel grouping-aggregation self-attention, which consists of grouping self-attention and aggregation self-attention. Specifically, grouping self-attention is employed to extract features from rich details coming from different channels, while the grouping mechanism is utilized to maintain lower computational complexity. On the other hand, our aggregation self-attention aims to fuse features from different channels to facilitate the exchange of information across channels. The GAT modulates dual self-attention to adaptively model fine-grained contextual details.

In the CNN branch, we propose a Wide-Spanning Separable 3D Attention (WSSA) to enhance the receptive field of the model while keeping a low parameter number. This method stacks a set of specifically designed small-kernel 3D convolutions to simulate the same receptive field as a large-kernel 3D convolution. Specifically, this method consists of three steps to simulate a large-kernel 3D convolution. Firstly, we simulate a large-kernel 3D convolution using the concept of depthwise separable convolution to reduce the parameter number, which cascades a pointwise and a large-kernel depthwise 3D convolutions. Subsequently, the large-kernel depthwise 3D convolution is decomposed into a small-kernel 3D depthwise convolution and a small-kernel dilated depthwise 3D convolution. Finally, the aforementioned two types of small-kernel depthwise 3D convolutions are further separated in the spatial and spectral dimensions to reduce computational complexity. Unlike 2D pixel attention, which solely considers spatial dimensions, WSSA preserves the inherent spatial-spectral consistency information. Based on WSSA, we construct a wide-spanning CNN module to extract inter-spectra features among different wavelengths while achieving a large receptive field to consider a wider range of contextual information.

Our hybrid architecture enables adaptive feature interactions between the CNN and transformer modules at each layer to facilitate the fusion of various features.

In summary, the contributions of this paper can be summarized as follows:

We propose a novel Grouping-Aggregation transformer (GAT) to capture intra-spectra interactions of object-specific contextual information from a spectral perspective. By modulating grouping self-attention and aggregation self-attention, GAT can adaptively model fine-grained contextual details.
We introduce Wide-Spanning Separable 3D Attention (WSSA) to explore inter-spectra feature extraction. It significantly enhances the receptive field while only increasing minimal additional parameters.
We designed a hybrid architecture named HyFormer that modulates the strengths of both CNN and transformer structures. HyFormer could adaptively fuse features extracted by CNN and transformer components, resulting in improved reconstruction outcomes. Abundant experimental results unequivocally demonstrate the substantial superiority of our HyFormer over state-of-the-art methods.

2. Proposed Methods

In this section, we will provide a detailed description of our method. The goal of hyperspectral image super-resolution is to restore a low-resolution hyperspectral image to a high-resolution hyperspectral image. Let

I_{L R} \in R^{B \times S \times h \times w}

and

I_{H R} \in R^{B \times S \times H \times W}

denote the low-resolution and high-resolution hyperspectral images, respectively. The S denotes the number of spectral bands. We use

I_{S R} \in R^{B \times S \times H \times W}

to denote restored hyperspectral image. Therefore, the process of super-resolution can be represented as follows:

I_{S R} = Network (I_{L R}) .

(1)

2.1. Overall Network

The proposed HyFormer is illustrated in Figure 2. The top part represents the transformer branch, while the bottom part represents the CNN branch. There are interactions between the two branches at each layer to exchange information with different characteristics. At the ends of branches, the features extracted from both branches are adaptively fused to enhance the representation.

Initially, we employ separable 3D convolution [42]

f_{S F}

to extract shallow-level features from the input low-resolution (LR) image. Separable 3D convolutions have been demonstrated to exhibit similar performance as conventional 3D convolutions while offering reduced computational complexity [42]. This process can be represented as follows:

F_{S F} = f_{3 D} (Unsqueeze (I_{L R})),

(2)

where

F_{S F} \in R^{B \times C \times S \times h \times w}

denotes extracted shallow-level features, and C denotes channel number. The

Unsqueeze (\cdot)

denotes reshaping

I_{L R}

into the shape of (

B \times 1 \times S \times h \times w

). Next, the feature

F_{S F}

is simultaneously fed into the transformer and CNN branches to extract different types of features. For transformer branch, the feature

F_{S F} \in R^{B \times C \times S \times h \times w}

needs to be reshaped into

{\tilde{F}}_{S F} \in R^{(B \times S) \times C \times h \times w}

by

f_{reshape 4 D} (\cdot)

. The feature extraction process for both branches in the first layer can be represented as follows:

\begin{matrix} T_{1} = f_{1}^{T} (f_{reshape 4 D} (F_{S F})) \\ C_{1} = f_{1}^{C} (F_{S F}), \end{matrix}

(3)

where

f_{1}^{T} (\cdot)

and

f_{1}^{C} (\cdot)

respectively denote the first transformer and CNN modules. The

T_{1}

and

C_{1}

are extracted features. In the subsequent feature extraction layers, we perform adaptive fusion of the two types of features to complement each other. Mathematically, the subsequent feature extraction process can be represented as:

\{\begin{matrix} T_{i} = f_{i}^{T} ({Conv 2 D}_{1 \times 1} (Concat (w_{i - 1}^{t} T_{i - 1}, w_{i - 1}^{c} f_{reshape 4 D} (C_{i - 1})))) \\ C_{i} = f_{i}^{C} ({Conv 3 D}_{1 \times 1 \times 1} (Concat ({\hat{w}}_{i - 1}^{t} f_{reshape 5 D} (T_{i - 1}), {\hat{w}}_{i - 1}^{c} C_{i - 1}))) \end{matrix} (i = 2, \dots, L),

(4)

where

w_{i - 1}^{t}, w_{i - 1}^{c}

and

{\hat{w}}_{i - 1}^{t}, {\hat{w}}_{i - 1}^{c}

are learnable coefficients used to adaptively adjust the feature fusion ratio, with an initial value of 1. The fused features are then fed into convolution layers

{Conv 2 D}_{1 \times 1}

and

{Conv 3 D}_{1 \times 1 \times 1}

to reduce the channel number. The

f_{reshape 5 D} (\cdot)

denotes the reshape operation from (

(B \times S) \times C \times h \times w

) to (

B \times C \times S \times h \times w

).

To enable the network to learn more informative representations, we perform feature fusion at the end of both the transformer and CNN branches. During this fusion process, learnable coefficients are employed to control the fusion ratio. Mathematically, this process can be represented as follows:

\{\begin{matrix} T = {Conv 2 D}_{1 \times 1} (Concat (ε_{1}^{t} T_{1}, \dots, ε_{i}^{t} T_{i}) \\ C = {Conv 3 D}_{1 \times 1 \times 1} (Concat ({\hat{ε}}_{1}^{c} C_{1}, \dots, {\hat{ε}}_{i}^{c} C_{i}) \end{matrix} (i = 1, \dots, L),

(5)

where

ε_{i}^{t}

and

{\hat{ε}}_{i}^{c}

denote learnable coefficients, initialized to 1.

The transformer branch focuses on capturing intra-spectra interactions at each specific wavelength, whereas the CNN branch facilitates efficient inter-spectra feature extraction among different wavelengths. Consequently, we fuse these two features to enhance the representation. Lastly, we employ deconvolution layers to perform upsampling, thereby increasing the spatial resolution. Due to the significant similarity between the LR input image and the SR output image, we incorporate the bicubic interpolated LR image into the output to guide the model to focus on high-frequency residual features. The final reconstruction process can be represented as follows:

I_{S R} = Bicubic (I_{L R}) + f_{3 D} (Up (F_{S F} + {Conv 3 D}_{1 \times 1 \times 1} (Concat (γ^{t} f_{reshape 5 D} (T), γ^{c} C)))),

(6)

where

γ^{t}

and

γ^{c}

are learnable coefficients to adjust the feature fusion ratio, with an initial value of 1. The

Up (\cdot)

denotes the deconvolution to increase the spatial resolution. The

f_{3 D} (\cdot)

is separable 3D convolution used to decrease the channel number.

2.2. Grouping-Aggregation Transformer Module

The transformer models have shown remarkable capability in capturing long-range dependencies in the spatial domain. However, these transformer models designed for natural images are not suitable for hyperspectral images due to their specific characteristics. Hyperspectral images exhibit sparsity in the spatial domain. Therefore, dense self-attention in the vanilla transformer would lead to numerous inefficient computations.

Based on the analysis mentioned above, we propose a grouping-aggregation transformer (GAT) to capture intra-spectra interactions of object-specific contextual information from a spectral perspective. During the extraction of shallow features, 3D convolution is applied to decompose each specific wavelength of the spectral data, thereby implicitly expressing diverse texture information across different channels. Our GAT treats each channel as a token and performs self-attention in the channel dimension to capture intra-spectra interactions for fine-grained contextual details on each specific wavelength. Meanwhile, self-attention allows for the consideration of all positions within the spatial domain, enabling the model to capture global dependencies more effectively.

We improve the standard of self-attention by introducing grouping-aggregation self-attention (GASA). The GASA is composed of grouping self-attention (GSA) and aggregation self-attention (ASA). To be more specific, we assign half of the channels to grouping self-attention, which is used to extract rich details from different channels. Meanwhile, the grouping mechanism ensures that the computation of self-attention is only performed within each group, reducing the computational cost of self-attention. The remaining half of the channels are dedicated to aggregation self-attention, which utilizes aggregation to merge features from different channels, allowing interaction among features of different textures. Subsequently, the extracted two features are concatenated to align the channel dimensions while achieving modulation of features from both self-attentions to adaptively model long-range channel dependencies from a spectral perspective.

Figure 3 illustrates the workflow of grouping-aggregation self-attention. Let

X \in R^{B S \times h w \times C}

denote the input features of GAT. For the grouping of self-attention, we linearly map

X

to obtain

q u e r y

Q^{g}

,

k e y

K^{g}

, and

v a l u e

V^{g}

:

Q^{g} = X W_{g}^{Q}, K^{g} = X W_{g}^{K}, V^{g} = X W_{g}^{V},

(7)

where

W_{g}^{Q}, W_{g}^{K}, W_{g}^{V} \in R^{C \times \frac{1}{2} C}

are learnable parameters. Subsequently, we equally divide the

Q^{g}

,

K^{g}

and

Q^{g}

tensors by channels into k different groups, ensuring that the self-attention calculation is performed only within each group. The grouping self-attention can be formulated as:

A^{g} = {Attention}^{g} (Q^{g}, K^{g}, V^{g}) = {\{V_{i}^{g} SoftMax ({(K_{i}^{g})}^{T} Q_{i}^{g} / \sqrt{h w})\}}_{i = 0}^{k} .

(8)

For aggregation self-attention, we map

X

to get

q u e r y

Q^{a}

,

k e y

K^{a}

, and

v a l u e

V^{a}

:

Q^{a} = X W_{a}^{Q}, K^{a} = Aggregation (X) W_{a}^{K}, V^{a} = Aggregation (X) W_{a}^{V},

(9)

where

Aggregation (\cdot)

is

1 \times 1

convolution to aggregate diverse features from different channels, while decreasing channel number from C to

ξ C

, in which

ξ \in (0, 1)

is hyperparameter to control aggregation ratio. The

W_{a}^{Q} \in R^{C \times \frac{1}{2} C}

,

W_{a}^{K}, W_{a}^{V} \in R^{ξ C \times \frac{1}{2} ξ C}

are learnable parameters. The aggregation self-attention can be formulated as:

A^{a} = {Attention}^{a} (Q^{a}, K^{a}, V^{a}) = V^{a} SoftMax ({(K^{a})}^{T} Q^{a} / \sqrt{h w}) .

(10)

Next, we concatenate

A^{g}

and

A^{a}

to incorporate features extracted by dual self-attention:

A = Concat (A^{g}, A^{a}),

(11)

where

A

is the feature representation extracted by grouping-aggregation self-attention. As shown in Figure 4, after grouping-aggregation self-attention, we utilize a multi-layer perceptron (MLP) with the non-linear activation function GELU [43] to enhance the representations. Before the grouping-aggregation self-attention and MLP, we use the LayerNorm layer [44] to do normalization. The feature extraction process in the transformer module can be represented as follows:

\begin{matrix} X_{T}^{'} = X + GASA (LN (X)) \\ X_{T} = X_{T}^{'} + MLP (LN (X_{T}^{'})), \end{matrix}

(12)

where

LN (\cdot)

denotes the LayerNorm and

MLP (\cdot)

denotes multi-layer perceptron. The

GASA (\cdot)

denotes the grouping-aggregation self-attention.

2.3. Wide-Spanning CNN Module

The 3D CNN has shown promising performance in hyperspectral image super-resolution, as it can extract features while preserving spatial-spectral consistency. It is indeed a common practice in previous works to utilize a

3 \times 3 \times 3

convolution kernel in 3D convolution, which inherently restricts the receptive field of the model. This limitation becomes particularly problematic for hyperspectral images, as the sparse nature of the spatial domain necessitates a larger receptive field. A small receptive field hampers the model’s ability to comprehend context, leading to potential limitations in performance. Directly increasing the size of the convolution kernel can expand the receptive field, but it comes with a significant increase in parameters and computational complexity, making it an impractical approach. In this paper, we propose a novel approach called wide-spanning separable 3D attention to address the aforementioned issue.

The wide-spanning, separable 3D attention is shown in the bottom right corner of Figure 5. We stack a set of specifically designed small-kernel 3D convolutions to simulate the same receptive field as a large-kernel 3D convolution.

In this paper, we simulate the large-kernel 3D convolution through three steps. First, depthwise separable convolution can significantly reduce the parameters and computational complexity. We simulate a

17 \times 17 \times 17

3D convolution using the concept of depthwise separable convolution to reduce the parameter number, which cascades

1 \times 1 \times 1

pointwise and

17 \times 17 \times 17

depthwise 3D convolution. Second, we further decompose

17 \times 17 \times 17

depthwise 3D convolution into a

5 \times 5 \times 5

depthwise 3D convolution and a

5 \times 5 \times 5

dilated depthwise 3D convolution with a dilation factor of

(3, 3, 3)

, which achieves the same receptive field as a

17 \times 17 \times 17

depthwise 3D convolution. As shown in Figure 6, we demonstrate this process of simulating the large receptive field.

However, due to the inherent nature of 3D convolutions, they still have a relatively large number of parameters. Separable 3D convolution has been proven to have similar effects to vanilla 3D convolution [42]. Therefore, we adopt a similar concept to separable 3D convolution to further decompose the

5 \times 5 \times 5

depthwise 3D convolution and the

5 \times 5 \times 5

dilated depthwise 3D convolution mentioned earlier. Specifically, we use a

1 \times 5 \times 5

depthwise 3D convolution and a

5 \times 1 \times 1

depthwise 3D convolution to simulate a

5 \times 5 \times 5

depthwise 3D convolution. Meanwhile, we use a

1 \times 5 \times 5

dilated depthwise 3D convolution with a dilation factor of

(1, 3, 3)

and a

5 \times 1 \times 1

dilated depthwise 3D convolution with a dilation factor of

(3, 1, 1)

to simulate a

5 \times 5 \times 5

dilated depthwise 3D convolution. As a result, the parameter number is significantly reduced, allowing for more feasible usage.

To effectively extract inter-spectra features among different wavelengths while achieving a large receptive field to consider a wider range of contextual information, we designed a wide-spanning CNN module that is applied to the CNN branch. The structure of our CNN module is illustrated in Figure 5. We employ both 3D convolution and 2D convolution to extract features and apply the proposed wide-spanning separable 3D attention at the end of the module to expend the receptive field. Skip connections are employed between each component to enhance the flow of information.

As mentioned earlier, separable 3D convolution provides similar performance to regular 3D convolution. Thus, in our CNN module, we employ separable 3D convolution to extract local spatial and spectral features simultaneously. It preserves spatial-spectral consistency, which is beneficial for restoring physically meaningful spectral curves. Furthermore, to enhance the extraction of spatial information, we incorporate 2D CNN units following the 3D convolution to explore spatial features. Thanks to the inclusion of our wide-spanning separable 3D attention, the model has a wider receptive field, allowing it to extract spatial and spectral features within a larger spatial-spectral context. The feature extraction in the CNN module can be represented as follows:

\begin{matrix} X_{3 D} = f_{3 D} (X) \\ X_{2 D} = X_{3 D} + f_{2 D U i n t} (f_{2 D U i n t} (X_{3 D})) \\ X_{C} = X_{2 D} + X_{2 D} ⊙ f_{W S A} (X_{2 D}), \end{matrix}

(13)

where

f_{3 D} (\cdot)

denotes separable 3D convolution and

f_{2 D U i n t} (\cdot)

denotes 2D convolution unit. The

f_{W S A} (\cdot)

is the proposed wide-spanning separable 3D attention.

3. Experiments

3.1. Datasets

In this section, we provide a comprehensive overview of the experimental results obtained using the proposed HyFormer on three publicly available hyperspectral datasets. We also present a detailed comparison with state-of-the-art (SOTA) methods to evaluate the performance of our approach. Additionally, we conduct ablation experiments for each module to validate their effectiveness and understand their individual contributions to the overall performance.

Our experiments are conducted on the following three hyperspectral datasets:

(1) CAVE dataset [45]: This dataset contains 31 scenes, which were gathered by a cooled CCD camera. The spectral wavelengths of the image range from 400 nm to 700 nm, with a step size of 10nm between bands. Therefore, each image has 31 spectral bands. Their spatial resolution is

512 \times 512

pixels.

(2) Harvard dataset [46]: This dataset consists of 77 hyperspectral images divided into indoor and outdoor scenes. They were captured by a Nuance FX, CRI Inc. camera with a wavelength range of 400 to 700 nm. Each hyperspectral image in the dataset has a size of

1040 \times 1392 \times 31

.

(3) Foster dataset [47]: This dataset consists of 30 images captured by a low-noise Peltier-cooled digital camera in the Minho region of Portugal. Each hyperspectral image contains 33 spectral bands, and they have a spatial resolution of

1204 \times 1344

pixels.

To ensure the fairness and reliability of our experimental evaluation, we train and test the models on each dataset separately. Because different datasets are collected using different hyperspectral cameras, which may have varying characteristics and imaging conditions. In our experiments, we adopt a random partitioning strategy where we allocate 80% of the samples from each dataset as the training set. The remaining 20% of samples are reserved exclusively for testing purposes. Due to the limited number of samples in the hyperspectral dataset, following [24,25], we employ data augmentation techniques during the training phase to augment the training samples. We randomly select 24 patches within each image and augment them through horizontal flipping, rotation (

90^{\circ}

,

180^{\circ}

, and

270^{\circ}

), and scaling (1, 0.75, and 0.5) operations. We downsample each patch using bicubic interpolation with a scale factor r (r = 2, r = 4, and r = 8) to obtain a low-resolution hyperspectral image. During the testing phase, we assess the model’s performance by evaluating its results on the

512 \times 512

region located in the top left corner of each test image.

3.2. Evaluation Metrics

To quantitatively measure the performance of the model, we utilize metrics such as Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), and Spectral Angle Mapper (SAM) to evaluate results. They are defined as:

\begin{matrix} PSNR = \frac{1}{S} \sum_{s = 1}^{S} 10 {log}_{10} (\frac{{MAX}_{s}^{2}}{{MSE}_{s}}) \\ {MSE}_{s} = \frac{1}{H W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} {(I_{S R} (h, w, s) - I_{H R} (h, w, s))}^{2}, \end{matrix}

(14)

where the

{MAX}_{s}

is maximal pixel value for sth band.

\begin{matrix} SSIM = \frac{1}{S} \sum_{s = 1}^{S} \frac{(2 μ_{I_{S R}}^{s} μ_{I_{H R}}^{s} + c_{1}) (2 σ_{I_{S R} I_{H R}}^{s} + c_{2})}{m \times n} \\ m = {(μ_{I_{S R}}^{s})}^{2} + {(μ_{I_{H R}}^{s})}^{2} + c_{1} \\ n = {(σ_{I_{S R}}^{s})}^{2} + {(σ_{I_{H R}}^{s})}^{2} + c_{2}, \end{matrix}

(15)

where

μ_{I_{S R}}^{s}

and

μ_{I_{H R}}^{s}

denote the mean of

I_{S R}

and

I_{H R}

for the sth band, respectively. The

σ_{I_{S R}}^{s}

and

σ_{I_{H R}}^{s}

denote the variance of

I_{S R}

and

I_{H R}

for the sth band. The

σ_{I_{S R} I_{H R}}^{s}

denotes the covariance of

I_{S R}

and

I_{H R}

for the sth band. The

c_{1}

and

c_{2}

are two constants.

SAM = \arccos (\frac{〈I_{S R}, I_{H R}〉}{{∥I_{S R}∥}_{2} {∥I_{H R}∥}_{2}}),

(16)

where

\arccos (\cdot)

denotes the arccos function, and the

〈\cdot, \cdot〉

denotes the dot product operation. The

{∥\cdot∥}_{2}

is

l_{2}

norm.

In general, higher values of PSNR and SSIM indicate better reconstruction results, while a lower value of SAM indicates better reconstruction.

3.3. Implementation Details

Taking into account the trade-off between performance and complexity, we set the number of layers to 4 (L = 4) in the transformer and CNN branches and the channel number S for the extracted shallow features

F_{S F}

to 96. For transformer branch, each transformer module consists of two grouping-aggregation transformers. We set the number of heads for self-attention in each layer to 2. The number of groups k for grouping self-attention is set to 4, and the aggregation rate

ξ

for aggregation self-attention is set to 0.25. For the CNN branch, we configure each CNN module to include one separable 3D convolution, two 2D convolution units, and a wide-spanning separable 3D attention block. In our experiments, we use

l_{1}

loss as the loss function. To optimize our model, we employ the AdamW optimizer [48] with a weight decay of 0.02. During the training process, we set the batch size to 16 with an initial learning rate of 5 ×

10^{- 4}

. The learning rate is halved every 35 epochs, and the total number of epochs is set to 100. All experiments are conducted using PyTorch on Nvidia 3090 GPUs.

3.4. Comparisons with State-of-the-Arts

To validate the effectiveness of our method, we compare it with six state-of-the-art (SOTA) methods, which are SSPSR [49], MCNet [24], ERCSR [25], GELIN [50], and MDFF [27].

As shown in Table 1, we present quantitative experiments with scale factors

\times

2,

\times

4, and

\times

8. Compared with state-of-the-art methods, we have significant advantages at each scale on all datasets. Specifically, for the CAVE dataset with a scale factor

\times

2, our method achieves a 0.532 dB higher PSNR than the second-best MDFF. For the CAVE dataset with scale factor

\times

4, our method also outperforms the second-best ERCSR by 0.556 dB. Meanwhile, our method outperforms the second-best method, MDFF, by 0.310 dB on the CAVE dataset with scale factor

\times

8. For the Harvard dataset, our method achieves PSNRs of 45.804 dB, 40.420 dB, and 36.251 dB, respectively, for scale factors

\times

2,

\times

4, and

\times

8, which are higher than other methods. For the Foster dataset, our method still shows advantages. Especially at a scale factor

\times

4, our PSNR is 0.423 dB higher than the second-best method. At scale factor

\times

8, our PSNR exceeds the second-best method by 0.901 dB, which represents a significant improvement. For the evaluation metric SAM, our method achieves the best results on CAVE, Harvard, and Foster datasets with scale factors

\times

2,

\times

4, and

\times

8. Taking the scale factor

\times

4 as an example, our SAM is 0.084 lower than the second-best method on the CAVE dataset. On the Harvard dataset, it is 0.017 lower than the second-best method. Additionally, on the Foster dataset, our SAM is 0.109 lower than the second-best method. These results demonstrate that our method can accurately recover spectral curves. For more quantitative results of the experiments, please refer to Table 1. In summary, from a quantitative perspective, our method possesses significant advantages. It outperforms the other comparative methods in various metrics, demonstrating clear superiority across all evaluation criteria.

As shown in Figure 7, Figure 8 and Figure 9, we present the mean absolute errors of the reconstructed images from the CAVE, Harvard, and Foster datasets. The mean error map provides a visual representation of the method’s performance. In the mean error map, darker colors indicate smaller errors, while brighter colors indicate larger errors. In Figure 7, Figure 8 and Figure 9, the mean error maps indicate that our proposed method achieves significantly smaller errors than other methods, particularly in high-frequency details where our method recovers more faithful textures. The bounding boxes highlight the regions where we have a clear advantage, with lower errors compared with other methods.

Specifically, Figure 7 consists of two peppers, and the challenge in its restoration lies in preserving the edges of the peppers and the color chart located above. From the mean error map, it is evident that our method exhibits the smallest error. The errors at the object edges are significantly smaller compared with other methods, indicating the superior performance of our approach. Furthermore, from the color chart above, it is evident that we have a clear advantage. The color chart restored by our method exhibits darker colors at the edges, while other methods tend to have brighter colors at the edges.

As shown in Figure 8, in our reconstruction result, there are more areas with lower brightness compared with other methods, indicating better performance. Additionally, our method demonstrates excellent reconstruction of the edge portions, indicating that our method excels at recovering high-frequency details.

Figure 9 illustrates the mean error map between the reconstructed result of hyperspectral image and its corresponding high-resolution image on the Foster dataset. We highlight the regions where we have significant advantages compared with other methods using bounding boxes. In these regions, our method exhibits lower brightness, indicating that we produce fewer errors in the reconstruction. Additionally, we achieve more faithful results at the edges of the objects, demonstrating that our method can better restore the details in the image.

Additionally, to demonstrate the precision of the spectral information restored by our method, we present the mean spectral difference curve. The average spectral difference curve vividly depicts the ability to recover spectral information. In Figure 10, we showcase the mean spectral difference curves obtained from the CAVE dataset. From Figure 10, it is evident that our curves remain in the lowest position consistently, indicating the smallest errors across all spectral bands in our reconstruction results. This advantage is obviously pronounced in Figure 10b.

In Figure 11, we present the mean spectral difference curves from the Harvard dataset. Our curves, once again, consistently remain in the lowest position, indicating that our method accurately recovers the spectral information of the hyperspectral images.

Floating-point operations (FLOPs) can accurately reflect the computational complexity of a model. In Table 2, we present a comparison of the FLOPs and PSNR for SSPSR, MCNet, ERCSR, GELIN, MDFF, and our HyFormer on the Harvard dataset with scale factor

\times

4. From the table, it can be seen that our HyFormer could achieve the highest PSNR with modest computational cost.

To further evaluate the performance on limited training samples, we conduct experiments on the CAVE (

\times

4) dataset by selecting 20% of the samples as our training set. As shown in Table 3, our method achieves the best performance in values of PSNR, SSIM, and SAM. Our PSNR is 0.41 dB higher than the second-best method, MDFF. This demonstrates that our method produces satisfactory performance even in the case of limited training samples.

3.5. Ablation Study

In this section, we conduct ablation studies to validate the effectiveness of our proposed HyFormer.

Effectiveness of the Grouping-Aggregation transformer Module. In our HyFormer, we devise a hybrid architecture incorporating a transformer branch and a CNN branch. The transformer branch employs a novel grouping-aggregation transformer (GAT), which effectively captures intra-spectra interactions of object-specific contextual information from a spectral perspective. In this part, we delve into an investigation of the transformer branch’s efficacy and evaluate its performance.

As shown in Table 4, the model with GAT achieves a PSNR improvement of 0.203 dB compared with the model without GAT (i.e., only consisting of the CNN branch). This significant improvement indicates that incorporating GAT enhances the model’s ability to model spectral information. Additionally, from the parameters listed in the second column of Table 4, it can be observed that the inclusion of GAT does not significantly increase the number of parameters, indicating that the method is practical.

In the GAT, we employ grouping-aggregation self-attention, comprising grouping self-attention (GSA) and aggregation self-attention (ASA).

The GSA effectively captures fine-grained details from different channels, while the ASA facilitates the interaction of texture information across channels. They complement each other, enabling a more comprehensive modeling capability. To validate the effectiveness of grouping-aggregation self-attention, we substitute it with traditional channel self-attention (CSA) to evaluate the performance. As shown in Table 5, the model with CSA has higher parameters, but surprisingly, it yields a lower PSNR value. Therefore, this demonstrates the effectiveness of our grouping-aggregation self-attention.

Effectiveness of Wide-Spanning Separable 3D Attention. A larger receptive field allows the model to capture information from a broader region. This is crucial for hyperspectral super-resolution tasks, as the spatial domain of hyperspectral images is often sparse. The larger receptive field can encompass more useful information, aiding the model in understanding the context and improving its performance in capturing fine details. However, the current mainstream convolutions only employ a kernel size of 3, which limits the receptive field of the model. Thus, we propose wide-spanning separable 3D attention (WSSA) to address this issue. Our WSSA effectively increases the receptive field while keeping parameter growth relatively low. It achieves this by stacking a set of specifically designed small-kernel 3D convolutions to mimic the receptive field of a large-kernel 3D convolution. Furthermore, it utilizes separable 3D convolutions to approximate the small-kernel 3D convolutions, resulting in smaller parameters.

As evident from Table 6, the inclusion of WSSA improves the PSNR value of the model by 0.14 dB, while the parameters only increase by 0.064 M. This demonstrates the effectiveness of our WSSA approach. It successfully improves the model’s performance while maintaining a minimal increase in the number of parameters.

Influence of Hyperparameters. We conduct experiments to test the influence of various important hyperparameters (number of layers L, number of groups k, and aggregation rate

ξ

) on the super-resolution performance. From Table 7, it can be observed that increasing the number of layers L improves the results in PSNR, SSIM, and SAM. However, an increase in the number of layers L also introduces a larger number of parameters. Therefore, to strike a balance between parameters and performance, we use L = 4 in our experiments. As shown in Table 8, the best performance is achieved when the number of groups

k = 4

in grouping self-attention. Thus, in our experiments, we configure the number of groups as

k = 4

. The effect of different aggregation rates

ξ

is shown in Table 9. We set

ξ

to 0.25 to strike a balance between the number of parameters and various performance metrics.

4. Conclusions

This paper introduces a method called HyFormer to improve the super-resolution of hyperspectral images, which incorporates three novel designs: (1) To capture intra-spectra interactions of object-specific contextual information, we propose a grouping-aggregation self-attention (GASA) mechanism for transformer. In GASA, grouping self-attention aims to extract diverse fine-grained features of targets, while aggregation self-attention enables information interaction among different channels. (2) To increase the receptive field to explore inter-spectra feature extraction, we introduce wide-spanning separable 3D Attention, which enlarges the receptive field with a low parameter number to better understand the contextual information in a hyperspectral image. (3) We design a hybrid architecture to adaptively fuse features extracted by transformer and CNN modules, leveraging their complementary strengths, thereby further enhancing the performance of hyperspectral super-resolution. Extensive experiments demonstrate the effectiveness of our proposed method.

There are also some barriers that need to be further improved. For example, it is worth mentioning that irregular shapes of objects in hyperspectral images often induce significant challenges to feature extraction with CNNs and transformers. The varying scales of targets also bring difficulties to the reconstruction. In our future research, we will focus on studying how to address the above issues to further improve the performance of hyperspectral image super-resolution. Additionally, considering the real-world application scenarios, we prepare to explore super-resolution methods that allow arbitrary scaling factors. The robustness of adapting to serious degradation is also a promising research direction in the future.

Author Contributions

Data curation, Y.J., J.S., Y.Z. (Yaping Zhang), H.Y. and Y.Z. (Yuan Zong); Formal analysis, Y.J., Y.Z. (Yaping Zhang), H.Y., Y.Z. (Yuan Zong) and L.X.; Funding acquisition, J.S.; Investigation, Y.Z. (Yaping Zhang), H.Y. and L.X.; Methodology, Y.J. and J.S.; Project administration, J.S.; Resources, J.S., Y.Z. (Yuan Zong) and L.X.; Software, Y.J., Y.Z. (Yaping Zhang) and H.Y.; Validation, Y.J. and Y.Z. (Yaping Zhang); Visualization, Y.J., J.S. and H.Y.; Writing—original draft, Y.J.; Writing—review & editing, Y.J., J.S., Y.Z. (Yuan Zong) and L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62002283, 62311530046, 62066020), the Key Research and Development Program of Shaanxi (Grant No. 2022GXLH-01-24) and the Fundamental Research Funds for the Central Universities (Grant No. xhj032021017).

Data Availability Statement

The datasets presented in this study are available public datasets.

Conflicts of Interest

The authors declare no conflict of interest.

References

Backman, V.; Wallace, M.B.; Perelman, L.; Arendt, J.; Gurjar, R.; Müller, M.; Zhang, Q.; Zonios, G.; Kline, E.; McGillican, T.; et al. Detection of preinvasive cancer cells. Nature 2000, 406, 35–36. [Google Scholar] [CrossRef] [PubMed]
Lu, G.; Fei, B. Medical hyperspectral imaging: A review. J. Biomed. Opt. 2014, 19, 010901. [Google Scholar] [CrossRef]
Kim, M.H.; Harvey, T.A.; Kittle, D.S.; Rushmeier, H.; Dorsey, J.; Prum, R.O.; Brady, D.J. 3D imaging spectroscopy for measuring hyperspectral patterns on solid objects. ACM Trans. Graph. 2012, 31, 38. [Google Scholar] [CrossRef]
Sabins, F.F. Remote sensing for mineral exploration. Ore Geol. Rev. 1999, 14, 157–183. [Google Scholar] [CrossRef]
Ji, Y.; Jiang, P.; Guo, Y.; Zhang, R.; Wang, F. Self-paced collaborative representation with manifold weighting for hyperspectral anomaly detection. Remote Sens. Lett. 2022, 13, 599–610. [Google Scholar] [CrossRef]
Lowe, A.; Harrison, N.; French, A.P. Hyperspectral image analysis techniques for the detection and classification of the early onset of plant disease and stress. Plant Methods 2017, 13, 80. [Google Scholar] [CrossRef]
Dian, R.; Li, S.; Guo, A.; Fang, L. Deep hyperspectral image sharpening. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5345–5355. [Google Scholar] [CrossRef]
Zheng, Y.; Li, J.; Li, Y.; Guo, J.; Wu, X.; Chanussot, J. Hyperspectral pansharpening using deep prior and dual attention residual network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8059–8076. [Google Scholar] [CrossRef]
Liu, J.; Wu, Z.; Xiao, L.; Sun, J.; Yan, H. A truncated matrix decomposition for hyperspectral image super-resolution. IEEE Trans. Image Process. 2020, 29, 8028–8042. [Google Scholar] [CrossRef]
Dian, R.; Li, S.; Fang, L. Learning a low tensor-train rank representation for hyperspectral image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2672–2683. [Google Scholar] [CrossRef]
Akgun, T.; Altunbasak, Y.; Mersereau, R.M. Super-resolution reconstruction of hyperspectral images. IEEE Trans. Image Process. 2005, 14, 1860–1875. [Google Scholar] [CrossRef]
Li, J.; Yuan, Q.; Shen, H.; Meng, X.; Zhang, L. Hyperspectral image super-resolution by spectral mixture analysis and spatial–spectral group sparsity. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1250–1254. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Han, Z.; He, S. Hyperspectral image super-resolution via nonlocal low-rank tensor approximation and total variation regularization. Remote Sens. 2017, 9, 1286. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Shi, J.; Liu, X.; Zong, Y.; Qi, C.; Zhao, G. Hallucinating face image by regularization models in high-resolution feature space. IEEE Trans. Image Process. 2018, 27, 2980–2995. [Google Scholar] [CrossRef] [PubMed]
Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207, Part XII 16. [Google Scholar]
Shi, J.; Zhao, G. Face hallucination via coarse-to-fine recursive kernel regression structure. IEEE Trans. Multimed. 2019, 21, 2223–2236. [Google Scholar] [CrossRef]
Tian, C.; Zhang, Y.; Zuo, W.; Lin, C.W.; Zhang, D.; Yuan, Y. A heterogeneous group CNN for image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–13. [Google Scholar] [CrossRef]
Zhou, L.; Cai, H.; Gu, J.; Li, Z.; Liu, Y.; Chen, X.; Qiao, Y.; Dong, C. Efficient image super-resolution using vast-receptive-field attention. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 256–272, Part II. [Google Scholar]
Liu, T.; Cheng, J.; Tan, S. Spectral Bayesian Uncertainty for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18166–18175. [Google Scholar]
Mei, S.; Yuan, X.; Ji, J.; Zhang, Y.; Wan, S.; Du, Q. Hyperspectral image spatial super-resolution via 3D full convolutional neural network. Remote Sens. 2017, 9, 1139. [Google Scholar] [CrossRef]
Yang, J.; Zhao, Y.Q.; Chan, J.C.W.; Xiao, L. A multi-scale wavelet 3D-CNN for hyperspectral image super-resolution. Remote Sens. 2019, 11, 1557. [Google Scholar] [CrossRef]
Li, Q.; Wang, Q.; Li, X. Mixed 2D/3D convolutional network for hyperspectral image super-resolution. Remote Sens. 2020, 12, 1660. [Google Scholar] [CrossRef]
Li, Q.; Wang, Q.; Li, X. Exploring the relationship between 2D/3D convolution for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8693–8703. [Google Scholar] [CrossRef]
Zhang, J.; Shao, M.; Wan, Z.; Li, Y. Multi-scale feature mapping network for hyperspectral image super-resolution. Remote Sens. 2021, 13, 4180. [Google Scholar] [CrossRef]
Li, Q.; Yuan, Y.; Wang, Q. Hyperspectral image super-resolution via multi-domain feature learning. Neurocomputing 2022, 472, 85–94. [Google Scholar] [CrossRef]
Tang, Z.; Xu, Q.; Wu, P.; Shi, Z.; Pan, B. Feedback Refined Local-Global Network for Super-Resolution of Hyperspectral Imagery. Remote Sens. 2022, 14, 1944. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, R.; Chen, X.; Hong, Z.; Li, Y.; Lu, R. Spectral Correlation and Spatial High–Low Frequency Information of Hyperspectral Image Super-Resolution Network. Remote Sens. 2023, 15, 2472. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2016, 29, 4905–4913. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Shi, J.; Wang, Y.; Dong, S.; Hong, X.; Yu, Z.; Wang, F.; Wang, C.; Gong, Y. Idpt: Interconnected dual pyramid transformer for face super-resolution. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Messe Wien, Austria, 23–29 July 2022; pp. 1306–1312. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Shi, J.; Wang, Y.; Yu, Z.; Li, G.; Hong, X.; Wang, F.; Gong, Y. Exploiting Multi-scale Parallel Self-attention and Local Variation via Dual-branch transformer-CNN Structure for Face Super-resolution. IEEE Trans. Multimed. 2023, 1–14. [Google Scholar] [CrossRef]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 280–296, Part IX. [Google Scholar]
Li, J.; Yu, Z.; Shi, J. Learning motion-robust remote photoplethysmography through arbitrary resolution videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1334–1342. [Google Scholar]
Pan, Z.; Cai, J.; Zhuang, B. Fast vision transformers with hilo attention. arXiv 2022, arXiv:2205.13213. [Google Scholar]
Ji, Y.; Jiang, P.; Shi, J.; Guo, Y.; Zhang, R.; Wang, F. Information-Growth Swin transformer Network for Image Super-Resolution. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3993–3997. [Google Scholar]
Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 305–321. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Yasuma, F.; Mitsunaga, T.; Iso, D.; Nayar, S.K. Generalized assorted pixel camera: Postcapture control of resolution, dynamic range, and spectrum. IEEE Trans. Image Process. 2010, 19, 2241–2253. [Google Scholar] [CrossRef] [PubMed]
Chakrabarti, A.; Zickler, T. Statistics of real-world hyperspectral images. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 193–200. [Google Scholar]
Nascimento, S.M.; Amano, K.; Foster, D.H. Spatial distributions of local illumination color in natural scenes. Vis. Res. 2016, 120, 39–44. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Jiang, J.; Sun, H.; Liu, X.; Ma, J. Learning spatial-spectral prior for super-resolution of hyperspectral imagery. IEEE Trans. Comput. Imaging 2020, 6, 1082–1096. [Google Scholar] [CrossRef]
Wang, X.; Hu, Q.; Jiang, J.; Ma, J. A Group-Based Embedding Learning and Integration Network for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5541416. [Google Scholar] [CrossRef]

Figure 1. Each spectral band of a hyperspectral image exhibits significant sparsity in the spatial domain, with many regions lacking meaningful information.

Figure 2. Overall architecture of our HyFormer, which consists of two branches: a transformer branch to enable intra-spectra interaction to capture fine-grained contextual details and a CNN branch for facilitating efficient inter-spectra feature extraction.

Figure 3. The grouping-aggregation self-attention in GAT, which consists of two parts: grouping self-attention and aggregation self-attention.

Figure 4. The structure of our grouping-aggregation transformer. The GSA denotes grouping self-attention and the ASA denotes aggregation self-attention.

Figure 5. The structure of our wide-spanning CNN module.

Figure 6. Simulating the process of large receptive field using small-kernel 3D convolutions.

Figure 7. Mean error maps of hyperspectral image (real_and_fake_peppers_ms) from CAVE dataset with scale factor

\times