EL-NAS: Efficient Lightweight Attention Cross-Domain Architecture Search for Hyperspectral Image Classification

Wang, Jianing; Hu, Jinyu; Liu, Yichen; Hua, Zheng; Hao, Shengjia; Yao, Yuqiong

doi:10.3390/rs15194688

Open AccessArticle

EL-NAS: Efficient Lightweight Attention Cross-Domain Architecture Search for Hyperspectral Image Classification

by

Jianing Wang

^1,*,†

,

Jinyu Hu

^2,†,

Yichen Liu

²,

Zheng Hua

²,

Shengjia Hao

² and

Yuqiong Yao

²

¹

Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, School of Computer Science and Technology, Xidian University, No. 2 South TaiBai Road, Xi’an 710071, China

²

School of Artificial Intelligence, Xidian University, No. 2 South TaiBai Road, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2023, 15(19), 4688; https://doi.org/10.3390/rs15194688

Submission received: 30 July 2023 / Revised: 10 September 2023 / Accepted: 11 September 2023 / Published: 25 September 2023

(This article belongs to the Special Issue Advanced Artificial Intelligence and Deep Learning for Remote Sensing II)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deep learning (DL) algorithms have demonstrated important breakthroughs for hyperspectral image (HSI) classification. Despite the remarkable success of DL, the burden of a manually designed DL structure with increased depth and size aroused the difficulty for the application in the mobile and embedded devices in a real application. To tackle this issue, in this paper, we proposed an efficient lightweight attention network architecture search algorithm (EL-NAS) for realizing an efficient automatic design of a lightweight DL structure as well as improving the classification performance of HSI. First, aimed at realizing an efficient search procedure, we construct EL-NAS based on a differentiable network architecture search (NAS), which can greatly accelerate the convergence of the over-parameter supernet in a gradient descent manner. Second, in order to realize lightweight search results with high accuracy, a lightweight attention module search space is designed for EL-NAS. Finally, further for alleviating the problem of higher validation accuracy and worse classification performance, the edge decision strategy is exploited to perform edge decisions through the entropy of distribution estimated over non-skip operations to avoid further performance collapse caused by numerous skip operations. To verify the effectiveness of EL-NAS, we conducted experiments on several real-world hyperspectral images. The results demonstrate that the proposed EL-NAS indicates a more efficient search procedure with smaller parameter sizes and high accuracy performance for HSI classification, even under data-independent and sensor-independent scenarios.

Keywords:

hyperspectral image; classification; lightweight; attention mechanism; neural architecture search

Graphical Abstract

1. Introduction

A hyperspectral remote sensing image (HSI) can be regarded as a 3D cube, which reflects the material’s spatial, spectral, and radiation information and land cover. Based on such abundant spectral bands, HSI data presents great importance in many practical applications, such as agriculture [1], environmental science [2], urban remote sensing [3], military defense [4], and other fields [5]. Various methods have been proposed for the HSI classification regarding the rich spectral information. Traditional representative algorithms such as sparse represent classification (SRC) [6], collaborative represent classification (CRC) [7], SVM with a nonlinear kernel projection method [8], and other kernel-based methods [9,10] are proposed by utilizing the rich discriminant spectral information. Spectral–spatial (SS) algorithms were gradually proposed to improve classification performance for sufficiently involving and fusing spatial correlation information. By introducing spatial features into HSI classification, a spatial–spectral derivative-aided kernel joint sparse representation (KJSR-SSDK) [11] and an adaptive nonlocal spatial–spectral kernel (ANSSK) [12] are proposed for extracting SS combination features. The above-mentioned algorithms are mainly keen on handcrafted features (spectral information or dimension reduction information of spectral information) with classifiers for HSI classification.

Inspired by the success of DL techniques, DL-based methods present prospects in the computer vision area, which can discover distributed feature representations of data by combining low-level features to form more abstract high-level representation features [13]. Spacial–spectral (SS)-based DL methods have presented promising performance for HSI classification. Additionally, 2D-CNN-based methods are proposed for extracting SS features, whereas dimension reduction of the HSI data is required first [14,15,16]. Considering the input data is a 3D cube in HSI processing, multi-scale 3D-CNN is introduced by considering filters of different sizes [17]. By combining ResNet with SS CNN, SSRN (SS residual network) is introduced to learn robust SS features from HSI [15]. RCDN (Residual Conv–Deconv Network) is proposed by densely connecting the deep residual networks [18]. Meanwhile, the deep feature fusion network (DFFN) was proposed to alleviate the overfitting and gradient disappearance problems of CNNs by taking into account the strong complementary correlation information between different layers of the neural network [19]. A novel deep generative spectral–spatial classifier (DGSSC) is proposed for addressing the issues of imbalanced HSIC [20]. Zhang et al. further proposed a deep 3D lightweight convolutional network consisting of dozens of 3D convolutional layers to improve classification performance [21]. However, it is worth noting that CNN-based algorithms are liable to indicate local information loss due to pooling layers. To cope with the problem, a dual-channel capsule network with GAN (DcCapsGAN) is proposed, which can generate pseudo-samples more efficiently and improve the classification accuracy and performance [22]. Additionally, a novel quaternion transformer network (QTN) for recovering self-adaptive and long-range correlations in HSIs is proposed in [23]. The Lightweight SS Attention Feature Fusion Framework (LMAFN) [24] is constructed based on architectural guidelines provided by NAS [25], and the proposed LMAFN achieves commendable classification accuracy and performance with a reduced parameter quantity. Specifically, LMAFN is a manually-designed neural network that incorporates the architectural principles from NAS to guide its feature fusion and network architecture. Therefore the entire network of LMAFN is manually constructed by the guiding rules established by NAS, but does not utilize automated searches for its architecture.

Nonetheless, in the realm of hyperspectral image classification, deep learning confronts multifaceted challenges, encompassing model intricacy, burdensome architectural design, and the inherent scarcity of accessible labeled hyperspectral data. These factors collectively impede the training efficacy and generalization prowess of deep learning paradigms. As a practical solution, transfer learning helps improve model performance when there is not much data available. It does this by transferring useful knowledge from a source domain, where plenty of data exists, to the target domain, which is lacking in data. Deep convolutional recurrent neural networks with transfer learning [26] present a sophisticated methodology for the extraction of spatial–spectral features, even in scenarios where the availability of training samples is limited. HT-CNN [27] propose a heterogeneous transfer learning that adjusts the differences between heterogeneous datasets through an attention mechanism. TL-ELM [28] introduces an ensemble migration learning algorithm built upon Extreme Learning Machines. This innovative approach not only preserves the input weights and hidden biases acquired from the target domain, but also iteratively fine-tunes the output weights using instances from the source domain.

Recently, the limitation of storage resources, power consumption, computational complexity, and parameter size hindered the application and implementation of DL-based algorithms for relevant applications, especially for edge devices and embedded platforms. Therefore, how to further realize lightweight and automated architecture design with limited storage and power constraints became a crucial issue [29,30]. Mobilenet V3 [31] efficiently combined the depthwise (DW) separable convolution, the inverted residual, and SE attention modules. Furthermore, EfficientNet V2 [32] and Squeezenet [33] all simultaneously involved attention modules and lightweight structures for efficiently improving classification performance. Nevertheless, the above-mentioned algorithms are mainly manually designed for specific tasks. In a real application, it is inherently a difficult and time-consuming task that relies heavily on expert knowledge. As research becomes more complex, the cost of debugging the model parameters of deep networks increases dramatically.

The Neural Architecture Search (NAS) approach effectively solves the problem of efficient and lightweight architectures for edge devices that are difficult to design. In general, there are mainly three mainstreams in NAS literature: reinforcement-learning-based (RL-based) NAS approaches, evolutionary-learning-based (EL-based) NAS approaches, and gradient-based (GD-based) NAS approaches. In RL-based NAS literature, the strategy is mainly iteratively generating new architectures based on learning a maximized reward from an objective (i.e., the accuracy on the validation set or model latency) [34,35,36]. In EL-based literature, architectures are represented as individuals in a population. Individuals with high fitness scores (verification accuracy) are privileged to generate offspring, thereby replacing individuals with low fitness scores. Large-Scale Evolution is proposed in [37], which applies evolutionary algorithms to discovery architectures for the first time. A new hierarchical genetic representation scheme and an expression search space supporting complex topologies are combined in Hier-Evolution [38], which outperforms various manually designed architectures for image classification tasks. However, most RL-based and EL-based NAS usually require high computational demand for the revolution in neural architecture design. For instance, NASNet [39] based on RL strategy demands 450GPUs for 4 days resulting in 1800 GPU-hours and MnasNet [40] used 64TPUs for 4.5 days in CIFAR-10. Similarly, Hier-Evolution [38] based on EL strategy needs to spend 300 GPU days to acquire a satisfying architecture in CIFAR-10. RL-based and EL-based NAS methods indicate that the neural architecture search in a discrete search strategy is usually regarded as a black-box optimization problem with an excessive process of structural performance evaluation.

In contrast to RL-based and EL-based NAS, the GD-based NAS approach continuously relaxes the original discrete search space, making it possible to optimize the architectural search space efficiently in a gradient descent manner. Following the cell-based search space of NASNet and exploring the possibility of transforming the discrete neural architecture space into a continuously differentiable form, DARTS [41] is developed by introducing an architecture parameter for each path and jointly training weights and architecture parameters via a gradient descent algorithm, which makes more efficient way for architecture search problem.

Therefore, inspired by the abovementioned problem and the literature, we construct an efficient attention architecture search (EL-NAS) for HSI classification in this paper. First, because of the efficiency of the GD-based architecture search, we mainly adopt the manner of differentiable neural architecture search as the main automatic DL design strategy to realize the efficient search procedure. Considering the real application for a mobile and embedded device, the lightweight and the attention module and 3D decomposition convolution are simultaneously exploited to construct the searching space, which can efficiently improve the classification accuracy with lower computation and storage costs. Meanwhile, aiming to mitigate the performance collapse caused by the number of skip operations in the searching procedure, the edge decision strategy, and the dynamic regularization is designed by the entropy and distribution of the non-skip operations to preserve the most efficient searching structure. Furthermore, generalization loss is introduced to improve the generalization of the searched model.

We then summarize the main contribution and the innovation of the proposed EL-NAS as follows:

EL-NAS successfully introduces the lightweight and attention module and 3D decomposition convolution for automatically realizing the efficient design of DL structure in the hyperspectral image classification area. Therefore, the efficient automatic searching strategy enables us to establish a task-driven automatic design of DL structure for different datasets from different acquisition sensors or scenarios.
EL-NAS presents remarkable searching efficiency through edge decision strategy to realize lightweight attention DL structure by imposing (i) the knowledge of successful lightweight 3D decomposition convolution and attention module in the searching space. (ii) The entropy of operation distribution estimated over non-skip operation is implemented to make the edge decision. (iii) Dynamic regularization loss based on the impact of the number of skip connections is adopted for further improving the searching performance. Therefore, the most effective and lightweight operations will be preserved by utilizing the edge decision strategy.
Compared with several state-of-the-art methods via comprehensive experiments in accuracy, classification maps, the number of parameters, and the execution cost, EL-NAS presents fewer GPU searching costs and lower parameters and computation costs. The experimental results on three real HSI datasets demonstrate that EL-NAS can search out a more lightweight network structure and realize more robust classification results even under data-independent and sensor-independent scenarios.

The rest of this article is organized as follows. Section 2 reviews the related works in HSI classification. The details of the proposed EL-NAS are described in Section 3. Experiments performance and analysis are designed and discussed in Section 4. Finally, the conclusions are summarized in Section 5 in this article.

2. Related Work

2.1. GD-Based NAS

The search space, search strategy and performance evaluation are main three aspects of GD-based NAS that could be improved. Compared to DARTS, ACA-DARTS [42] removes skip connections from the operation space by introducing an adaptive channel allocation strategy to refill the skip connections in the evaluation stage. PAD-NAS [43] can automatically design the operations for each layer and achieve a trade-off between search space quality and model diversity. In [44], a new search space is introduced based on the backbone of the convolution-enhanced transformer (Conformer), a more expressive architecture than the ASR architecture used in existing NAS-based ASR frameworks. For the re-identification (ReID) task, CDNet [45] is proposed based on a novel search space called the combined depth space (CDS). By using the combined basic building blocks in the CDS, CDNet tends to focus on the combined pattern information normally found in pedestrian images. For hyperspectral image classification task, 3D-ANAS is proposed by a three-dimensional asymmetric decomposition search space, which realized the efficiency improvement of classification [46]. A novel hybrid search space is also proposed in [47], where 3D convolution, 2D spatial convolution and 2D spectral convolution are employed.

Recent research gradually focused on how to avoid the well-known performance collapse caused by an inevitable aggregation of skip connections and mitigate the drawbacks of weight sharing in DARTS. DARTS+ [48] leverages early stopping to avoid the performance collapse in DARTS. PC-DARTS [49] exploits the redundancy of the network space by sampling parts of the supernet for a more efficient search. DARTS- [50] offsets the advantages of a skip connection with an auxiliary skip connection, ensuring more equitable competition for all operations. SGAS [51] partitioned the search process into sub-problems to select and greedily reduce candidate operations. FairNAS [52] suggests strict fairness, where each iteration of the supernet has to train the parameters of each candidate module at each layer, which guarantees that all candidate modules have equal opportunities for optimization throughout the training process. Single-DARTS [53] updates network weights and structural parameters simultaneously in the same batch of data instead of replacing bi-level optimization, which significantly alleviates performance collapse and improves the stability of the architecture search. Zela et al. [54] demonstrated that the various types of regularization can improve the robustness of DARTS to find solutions with better generalization properties. Additionally,

β

-DARTS [55] proposes a simple-but-efficient Beta-Decay regularization method to regularize the DARTS-based NAS searching process. U-DARTS [56] redesigns the search space by combining the new search space with sampling and parameter-sharing strategies, where the regularization method considers depth and complexity to prevent network deterioration. FP-DARTS [57] constructs two over-parameter sub-networks that formed a two-way parallel hypernetwork by introducing dichotomous gates to control whether the paths were involved in the training of the hypernetwork.

2.2. NAS for HSI

It is usually a trivial and challenging task to realize state-of-the-art (SOTA) neural networks for a specific task with manually designed expert knowledge and effort. PSO-Net [58] is based on Particle Swarm Optimization (PSO) and an evolutionary search method for hyperspectral image data, enabling accelerated convergence. CPSO-Net [59] presents a more efficient continuous evolutionary approach that can speed the generation of architectures as weight-sharing parameters are optimized. Inspired by DARTS, Chen et al. proposed an automatic CNN (Auto-CNN) [60] for HSI classification by introducing a regularization technique named cutout to improve the classification accuracy. Auto-CNN outperformed with fewer parameters than manually designed DL architectures. A hierarchical search space is proposed in 3D-ANAS [46], which searches both topology and network width and introduces a three-dimensional asymmetric decomposition search space with extremely effective classification performance. A2S-NAS [61] proposes a multi-stage architectural search framework to overcome the asymmetric spatial dimension of the spectrum and capture important features. LMSS-NAS [62] proposes a lightweight multiscale NAS with spatial–spectral attention, which is centered on the design of a search space composed of lightweight convolutional operators and the migration of label smoothing losses into the NAS to ameliorate the problem of unbalanced samples.

Based on the efficiency search strategy of DARTS, we additionally adopt the edge decision algorithm to alleviate performance collapse. Simultaneously, considering the lightweight modules and attention module can improve model performance with fewer parameters, we construct EL-NAS to improve the generalization of the searched model, which further makes it more possible for HSI classification method applied on edge devices.

3. Methodology

In this section, we mainly introduce the procedure of our proposed EL-NAS in detail. The overall workflow is illustrated in Figure 1. The modular search space is mainly used to fully exploit the hyperspectral data characteristics and construct an over-parameterized supernet while combining the lightweight module and attention mechanisms for optimal models with high performance and generalization. The regularization-based edge-decision search strategy involves an edge-decision algorithm that significantly accelerates the convergence of the over-parameter supernet and mitigates weight sharing, which can efficiently limit unfair competition in the search process for skip connections. The performance evaluation defines the metrics guidance of the NAS for high-performance and highly generalization models.

3.1. Modular Search Space

In order to realize more efficient and compact search results for hyperspectral image classification task by the NAS method, we propose a modular search space different from DARTS [41]. In the modular search space, modules are regarded as candidate operations rather than simple convolutions, which can exploit the experience of manual architecture design and ensure the stability of search results.

The whole modular search space is illustrated in Figure 2. Each cell is a DAG (Directed Acyclic Graph) consisting of an ordered sequence of N nodes (including two input nodes,

N - 3

intermediate nodes, and one output node), where each node

x^{(i)}

represents a feature map in network, where i is the order in the DAG. An operation

o^{(i, j)}

that transforms node

x^{(i)}

to node

x^{(j)}

, associated with each directed edge

e_{i, j}

connects node

x^{(i)}

and node

x^{(j)}

in the DAG. Meanwhile, each intermediate node is connected with all its predecessors. The set of edges E is formulated as

\begin{matrix} E = {e_{i, j}, 0 \leq i \leq j, 1 < j < N + 2} \end{matrix}

(1)

Each edge

e_{i, j}

contains all candidate operations (alias paths), and DARTS transforms the discrete operation options into a differentiable parameter optimization problem by continuously relaxing the outputs of the different operations through a set of learnable architectural parameters

α

. For example, the mixed operation

{\bar{o}}^{(i, j)}

taking the feature map

x^{(i)}

as input can be represented as follows:

\begin{matrix} {\bar{o}}^{(i, j)} (x^{(i)}) = \sum_{o \in O} \frac{e x p (α_{o}^{(i, j)})}{\sum_{o^{'} \in O} e x p (α_{o^{'}}^{(i, j)})} o (x^{(i)}) \end{matrix}

(2)

where

O

is the set of candidate operations (i.e.,

o p 1

,

o p 2

, …,

z e r o

), o is a specific operation to be applied to

x^{(i)}

, and parameterized by architecture parameters

α_{o}^{(i, j)}

. Each intermediate node

x^{(j)}

is computed by the following formula:

\begin{matrix} x^{(j)} = \sum_{i < j} {\bar{o}}^{(i, j)} x^{(i)} \end{matrix}

(3)

The input nodes in the DAG are represented by the output of the previous convolution and previous two cells, and the output nodes are represented by concatenating all intermediate nodes.

We learn from experience in manual architecture design and exploit existing modules as the candidate operations of search space, including lightweight module, attention module, and 3D decomposition convolution.

(1): Lightweight module (i.e., inverted residual block in MobileNetv2 [63], IR) involves pointwise convolution and depthwise separable convolution. The purpose of the inverted residual module is to increase the number of channels by pointwise convolution and then perform depthwise separable convolution in higher dimensions to extract better channel features without significantly increasing the model parameters and computational costs.
(2): Attention module (i.e., Squeeze-and-Excitation [64], SE) adaptively learns weights for different channels using global pooling and fully connected layers. Hundreds of spectral channels is a significant characteristic of hyperspectral images, where different channels contribute differently to the feature classification task, so the channel attention module is essential as verified in the experimental section.
(3): 3D decomposition convolution. In this paper, 3D convolution is decomposed into two types of decomposition convolution for processing spectral and spatial information, respectively. The principle of 3D decomposition convolution is shown in Figure 3, where a 3D convolution with a kernel size of $C \times K \times K$ is decomposed into two decomposition convolutions with a kernel size of $C^{'} \times 1 \times 1$ and $1 \times K^{'} \times K^{'}$ , respectively. This simplifies the complexity of a single candidate operation and allows the search space to yield more possibilities of models, which can significantly reduce the model parameters.

Therefore, our modular search space fully considered the characteristics of hyperspectral data and presented the following improvements: (1) We take patches as input for accelerating the speed of processing, where the down-sampling operation is not essential in this procedure. (2) In order to efficiently extract the discriminative features of SS, we involve 3D decomposition convolution in search space as the candidate operation. (3) By well-designed attention module on the hyperspectral channel, our method can fully exploit spectral discriminant information. (4) The SS information of the HSI dataset is further extracted by using inverse residual module with less parameter.

3.2. Regularization-Based Edge-Decision Search Strategy

How to search the optimal architecture from the discrete search space in a differentiable manner is a key challenge after the construction of modular search space. Therefore, softmax transforms the search for network architectures from selecting discrete candidate operations to optimize the probability of continuous mixed operations.

3.2.1. Bi-Level Optimization

The network weights

ω

and the architecture parameters

α

are two parameters need to be optimized.

α

denotes the weights of different operations/paths on all edges, while

ω

is the internal parameters in operations. The bi-level optimization problem is presented by using the following formula to jointly optimize

α

and

ω

:

\begin{matrix} min_{α} L_{v a l} (ω^{*} (α), α) \end{matrix}

(4)

\begin{matrix} s . t . ω^{*} (α) = arg min_{ω} L_{t r a i n} (ω, α) \end{matrix}

(5)

where

L_{v a l}

and

L_{t r a i n}

denote the validation and training losses, respectively.

α

is the upper-level variable, and

ω

is the lower-level variable. The optimal

α

is obtained by the above bi-level optimization formula, and then the final neural architecture is derived by discretization on chosen operations.

The discretization process is to select the operation

o^{(i, j)}

with the highest weight on the directed edge

e_{i, j}

and discard other operations:

\begin{matrix} o^{(i, j)} = arg max_{o \in O} α_{o}^{(i, j)} \end{matrix}

(6)

The bi-level optimization process mentioned above presents the following problems: (1) In the later period of the search, the number of skip connections in the selected architecture increases sharply, which is liable to resulting in performance degradation. (2) The weight sharing between subnets leads to inaccurate evaluation. Therefore, the edge decision criterion is exploited to alleviate the above-mentioned problem.

3.2.2. Edge Decision Criterion

The design of the selection criterion is crucial to guarantee that the most optimal edge is chosen during edge decision, i.e., maintaining the optimization of the supernet. Two aspects of edges should be considered: edge importance and selection certainty.

Edge Importance

Skip operation on important edges should have a lower weight. Therefore, the edge importance is defined to measure the weight of non-skip operations:

\begin{matrix} S_{E I}^{(i, j)} = \sum_{o \in O, o \neq s k i p} \frac{e x p (α_{o}^{(i, j)})}{\sum_{o^{'} \in O} e x p (α_{o^{'}}^{(i, j)})} \end{matrix}

(7)

Selection Certainty

Denote distribution

p_{o}^{(i, j)} = \frac{e x p (α_{o}^{(i, j)})}{S_{E I}^{(i, j)} \sum_{o^{'} \in O} e x p (α_{o^{'}}^{(i, j)})}, o \in O, o \neq s k i p

represents the normalized softmaxed weights of non-skip operation. Selection certainty is defined as the normalized entropy of the operation distribution

p_{o}

to measure the certainty of distribution:

\begin{matrix} S_{S C}^{(i, j)} = 1 - \frac{- \sum_{o \in O, o \neq s k i p} p_{o}^{(i, j)} l o g (p_{o}^{(i, j)})}{l o g (| O | - 1)} \end{matrix}

(8)

Then, the edge importance

S_{E I}^{(i, j)}

and the

S_{S C}^{(i, j)}

are normalized to calculate the final score and select the edge with the highest score:

\begin{matrix} S_{e}^{(i, j)} = n o r m a l i z e (S_{E I}^{(i, j)}) * n o r m a l i z e (S_{S C}^{(i, j)}) \end{matrix}

(9)

where

n o r m a l i z e (\cdot)

denotes the standard

M i n - M a x

scaling regularization. First, an edge

e_{i^{+}, j^{+}}

is selected greedily according to the above edge decision criterion, i.e.,

(i^{+}, j^{+}) = a r g {max}_{(i, j)} S_{e}^{(i, j)}

. The corresponding mixture operation

{\bar{o}}^{(i, j)}

is replaced with the optimal operation via

o^{(i^{+}, j^{+})} = arg {max}_{o \in O} α_{o}^{(i^{+}, j^{+})}

. The weights and architectural parameters of the remaining paths within the mixed operation are no longer needed as the architectural parameters, and the network weights are gradually pruned in the optimization iteration for drastically improving the efficiency of search procedure. The remaining over-parameters supernet

S_{-}

(including remaining

A_{-}

and

W_{-}

) forms a new subproblem, which is also defined based on DAG. Furthermore, the operations on edge are selected iteratively by solving the remaining subproblems. Therefore, the validation accuracy better reflects the final evaluation accuracy as the model discrepancy is minimized in this procedure. Additionally, we provide each intermediate node eventually preserves two input edges. Once a node has two determined input edges, its other input edges will be pruned.

3.2.3. Dynamic Regularization (DR)

Following the determination of the edge decision, the regularization term that considers the quantity of skip connections is chosen to guide the adjustment of the skipped architecture parameters through the regularity factor

δ

. This approach effectively mitigates the issue of unfair competition among skip connections. More precisely, the dynamic regularity is defined as

\begin{matrix} α_{s k i p}^{(i, j)} (x) = δ α_{s k i p}^{(i, j)} (x) \end{matrix}

(10)

The performance of the architecture is observed to exhibit a distribution closely resembling a Gaussian distribution when influenced by the number of skip connections. Consequently, the symbol

δ

is introduced to represent this phenomenon:

\begin{matrix} δ = a e^{- \frac{{(n_{s k i p} - μ)}^{2}}{2 σ^{2}}} \end{matrix}

(11)

where

n_{s k i p}

is the number of skip connections in the selected operation. a,

μ

, and

σ

are the parameters in the Gaussian distribution. Dynamic regularity ensures that the selection of skip connections is encouraged when the number of skip connections is not enough and discouraged when a large number of skip connections is involved. As illustrated in Figure 4, the suitable number of skip connections will keep the architecture performance at the optimization status so as to avoid performance collapse caused by skip connections. The super-parameter settings of the Gaussian function were estimated by substituting the numerical results into a Gaussian distribution, e.g.,

a = 0.81

,

μ = 1.22

,

σ = 2.17

. The whole searching workflow is illustrated in Figure 5, and an independent network without weight sharing is obtained.

3.3. Performance Evaluation

Designing a suitable loss function is a crucial task in the searching procedure for the optimal architecture. The cross-entropy loss

L_{C E}

measures the difference between the predicted value and the ground truth in the classification task, the loss is defined as

\begin{matrix} L_{C E} = - \frac{1}{n} \sum_{k = 1}^{n} (y_{k} l o g {\hat{y}}_{k} + (1 - y_{k}) l o g (1 - {\hat{y}}_{k})) \end{matrix}

(12)

where n denotes the number of samples,

y_{k}

is the ground-truth label of the given sample, and

{\hat{y}}_{k}

is the predicted label.

The generalization ability of the model can be quantified regarding the differences between the training and evaluation metrics. For two models with similar training metrics, the model with better evaluation metrics is more generalizable because it can better predict an unknown dataset. The training dataset is taken in the searching phase to optimize the model weights. The validation dataset is only available for model searching. The generalization loss

L_{g}

is defined as the difference between the training metrics

L_{C E}^{t r a i n}

and the validation metrics

L_{C E}^{v a l}

, which measures the generalization ability of the model and is designed to guide the searching procedure.

\begin{matrix} L_{g} = | L_{C E}^{v a l} - L_{C E}^{t r a i n} | \end{matrix}

(13)

where

| \cdot |

is the absolute value. To further improve the robustness and generalization performance of the searched model, we integrated the Beta-Decay regularization [55]. Specifically, the Beta-Decay regularization can impose constraints to keep the values and variances of the activation architecture parameters not too large.

\begin{matrix} L_{β} = l o g (\sum_{k = 1}^{| O |} e^{α_{k}}) \end{matrix}

(14)

where

O

is the set of candidate operations and

α_{k}

is the architectural parameter associated with operation k. We construct the above losses in an automatic way as optimization objectives.

\begin{matrix} L = L_{C E} + λ_{1} L_{g} + λ_{2} L_{β} \end{matrix}

(15)

where

λ_{1}

and

λ_{2}

are the hyperparameters, the learnable parameters can adaptively balance the weights of different losses. The whole procedure of the proposed EL-NAS is summarized in the Algorithm 1.

Algorithm 1 The overall procedure of the proposed EL-NAS.

Input: training dataset $X_{T r a i n}$ (training samples $X_{t r a i n}$ and labels $Y_{t r a i n}$ , validation samples $X_{v a l i d}$ and labels $Y_{v a l i d}$ ), test dataset $X_{T e s t}$ , batch size n, decision frequency e
Initialization: Modular supernet $S$ (architecture parameters $A = {α^{(i, j)}}$ and network weights $W = {ω^{(i, j)}}$ ), mixed operation ${\bar{o}}^{(i, j)}$ parameterized by $α^{(i, j)}$ for each edge $e_{i, j}$
Search Stage:
while exist undetermined edge do
1. input validation samples $X_{v a l i d}$ , and compute the $L_{v a l i d}$ by $S$ ;
2. Update undetermined architecture parameters $A$ by descending $\nabla_{A} L_{v a l i d} (W, A)$ ;
3. Input training samples $X_{t r a i n}$ , and compute the $L_{t r a i n}$ by $S$ ;
4. Update weights $W$ by descending $\nabla_{W} L_{t r a i n} (W, A)$ ;
5. Count the number of skip connections in the currently selected operation, compute the regularity factor $δ$ ;
6. adjust the architecture parameters of the skip connections by $α_{s k i p} = δ α_{s k i p}$ ;
7. If the current epoch satisfies decision frequency e, select an edge $e_{i^{+}, j^{+}}$ based on edge decision criterion $S_{e}$ ;
8. Replace ${\bar{o}}^{(i, j)}$ with $o^{(i, j)} = arg {max}_{o \in O} α_{o}^{(i, j)}$ ;
9. Delete unchosen weights $ω_{u n c h o s e d}^{(i^{+}, j^{+})}$ from $W$ , remove $α_{u n c h o s e d}^{(i^{+}, j^{+})}$ from $A$ ;
Search Output: The final architecture $α^{*}$ derived from the selected operation.
Evaluation Stage:
Input training set $X_{T r a i n}$ and optimize network weights $ω$ by descending $L_{C E}^{T r a i n}$ ;
for sample $x$ in $X_{T e s t}$ :
Input $x$ into $α^{*}$ , and obtain the predicted result;
Evaluation Output: The classification results for all test samples $X_{t e s t}$

4. Experiments

In this section, we mainly introduce five HSI data sets and the evaluation metrics utilized in this paper. The experiments performance, ablation studies, the parameters of the model and the running time are discussed and analyzed. In order to verify the effectiveness of EL-NAS in different scenarios, we also evaluate our method under independent scenarios with the same sensors (IN and SA) and also independent scenarios under different sensor circumstances (IN, UP, HU, SA, and IMDB).

4.1. Hyperspectral Data Sets

Indian Pines (IN) was collected by AVIRIS sensors in northwest India in 1992. This scene has 220 data channels, the spectral range is 0.2 to 2.4

μ

m, and the size of each spectral dimension is 145 × 145. The image has a spatial resolution of 20 m/pixel and contains 16 feature categories, in which two-thirds are agriculture and one-third is forests or other natural perennial plants. Figure 6 shows the three-band false color composite of IN images and the corresponding ground truth data, respectively.

The ROSIS-03 sensor recorded the Pavia University (UP) image of the University of Pavia over Pavia in northern Italy. The image captures the urban area around the University of Pavia. The image size is 610 × 340 × 115, the spatial resolution is 1.3 m/pixel, and the spectral coverage is 0.43 to 0.86

μ

m. The image contains nine categories. Before the experiment, 12 frequency bands and some samples containing no information were removed. Figure 7 shows the three-band false-color composite of the UP image and the corresponding ground truth data.

The Houston (HU) data set was collected by the Compact Aerial Spectral Imager (CASI) in 2013 on the University of Houston campus and adjacent urban areas. HU has 144 spectral channels, the wavelength range is 0.38 to 1.05

μ

m, and the space size of 1905 × 349 is 2.5 m/pixel. It has 15 different ground truth classes with 15,029 marked pixels. Figure 8 shows the three-band false-color composite of the HU image and the corresponding ground truth data.

Salinas (SA) is captured by the 224-band AVIRIS sensor over the Salinas Valley in California and features high spatial resolution (3.7m pixels). The coverage area includes 512 rows by 217 samples. Like Indian Pines, 20 water absorption bands were discarded, leaving 224 bands remaining. The image contains 16 categories. Figure 9 shows the three-band false-color composite of the SA image and the corresponding ground truth data.

The Chikusei (IMDB) data set was collected by Hyperspectral Visible Near-Infrared Cameras (Hyperspec-VNIR-C) in Chikusei, Ibaraki, Japan, on 19 July 2014. It contains 19 classes and has 2517 × 2335 pixels. Its spatial resolution is 2.5 m per pixel. It consists of 128 spectral bands, which range from 363 to 1018 nm. The IMDB dataset was utilized in the sensor-independent scenario to verify the effects of the proposed EL-NAS. Figure 10 shows the three-band false-color composite of the IMDB image and the corresponding ground truth data.

4.2. Experimental Configuration

We take a pixel-centered patch of size

9 \times 9

as input data. The classification results are all summarized with the standard deviation of the estimated means by five independent random runs in experiments to avoid possible bias caused by the random sampling. The number of samples in each category in the training set and test set is also shown in Table 1.

All the experiments in this paper are executed under the computer configuration as follows: An Intel Xeon W-2123 CPU at 3.60 GHz with 32-GB RAM and an NVIDIA GeForce GTX 2080 Ti graphical processing unit (GPU) with 27.8-GB RAM. The software environment is the system of 64-bit Windows 10 and DL frameworks of Pytorch 1.6.0.

4.3. Search Space Configuration

Five types of candidate operations are selected to construct the modular search space (MSS):

Lightweight modules ( $3 \times 3$ and $5 \times 5$ inverted residual modules, IR).
Three-dimensional decomposition convolution (3D convolution with kernel size of $1 \times 1 \times 7$ (SPA)and $3 \times 3 \times 1$ (SPE)).
Attention modules (SE).
Skip connection ( $f (x) = x$ ).
None ( $f (x) = 0$ ).

During the searching phase, a network is constructed using two normal cells. Within each normal cell, the stride for each convolution is set to 1. Throughout the search process, each cell comprises eight nodes, which include five intermediate nodes and a total of 20 edges.

4.4. Hyperparameter Settings

In the searching phase, we divide the training set into training and validation samples at a ratio of 0.5. Stochastic gradient descent (SGD) is used to optimize the model weight W, the initial learning rate is 0.005, the momentum is 0.9, and the weight decay is

3 \times 10^{- 4}

. For the architecture parameter A, an Adam optimizer with an initial learning rate of

3 \times 10^{- 4}

, momentum

(0.5, 0.999)

, and weight decay of

10^{- 3}

is used. Edge decisions are made according to the selection criterion, and a complete supernet is not trained during the entire searching phase. After 50 epochs of warm-up, the edge decision is executed every five epochs. In addition, the batch size is increased by 16 after each edge decision, which can further improve the search efficiency.

In the training phase, we perform model training in 1000 epochs with a batch size of 128 and use a random gradient with an initial learning rate of 0.005, a momentum of 0.9, and a weight decay of

3 \times 10^{- 4}

. The gradient descent optimizer optimizes the model weight W. Other essential hyperparameters include gradient clipping set to 1 and dropout probability set to 0.3.

4.5. Ablation Study

4.5.1. Different Candidate Operations

In this section, we will analyze the effects of different candidate operations and verify the effectiveness of the modular search space, which is shown in Table 2. Based on the comparison between IR and BASE, the results of using the lightweight module are better than the basic convolution. The channel attention SE is ideally suited to datasets with a massive spectrum and significantly boosts performance. The performance of SPE and SPA further improves the performance because of the enhanced ability to extract 3D features of hyperspectral images. We can observe that MSS candidate operations achieved the optimal performance.

4.5.2. Strategy Optimisation Scheme

Three distinct architectural designs were explored within each optimization strategy to assess their impact on the search process. The evaluation results for these architectures are presented in Table 3. The regularization term

L_{β}

serves to constrain exceedingly large

α

, thereby allowing for the inclusion of architectural parameters that better represent high-quality architectures. The term

L_{g}

enhances model performance by approximately

0.4 %

, corroborating the notion that a more generalized search model is likely to yield an optimally performing architecture. Figure 11 compares the number of skip connections in models with and without Dynamic Regularization (DR) across ten different searches. DR enables the automatic, dynamic adjustment of weights based on the current number of skip connections during each iteration, thereby reducing the frequency of skip operations and leading to more stable search outcomes.

4.6. Architecture Evaluation

In the first set of experiments, we mainly verify the proposed EL-NAS performance under the same scenario (Searching and Test under the same dataset). We randomly select

3 %, 1 %, 3 %

of whole labeled samples as the training set,

3 %, 3 %, 3 %

of the whole labeled samples as the validation sets, and the remaining

94 %, 96 %, 94 %

is used as the test set for the three HSI data sets of IN, UP and HU. We first search for the architecture on the training set and reserve an optimal architecture for evaluation. The optimal cell structures obtained from the three data sets are shown in Figure 12. We compare the proposed EL-NAS model with traditional methods SVMCK, six DL methods (2D-CNN, 3D-CNN, DFFN, SSRN, DcCapsGAN, and LMAFN), and one NAS method for HSI classification (Auto-CNN, i.e., 3D-Auto-CNN).

According to the quantitative comparison results shown in Table 4, Table 5 and Table 6, compared with the traditional method SVMCK, DL-based algorithms can achieve better classification results on the three data sets. CNN-based methods can be divided into 2D-CNN and 3D-CNN. Overall, 2D-CNN can extract more discriminative SS features through convolution operation and the nonlinear activation function. Compared with 2D-CNN, 3D-CNN achieves better classification accuracy by fully learning spectral features. Both DFFN and SSRN fuse SS features, and SSRN indicates better results than DFFN. DcCapsGAN integrates GAN and capsule networks to preserve features’ relative location further to improve classification performance. LMAFN adopts lightweight structures, which greatly increases the network depth while reducing the size of the model as well as enhances the nonlinear fitting ability of the model. Additionally, the above traditional algorithms and manually designed DL-based methods are subject to the constraints of subjective human cognition. Auto-CNN achieves satisfactory results in an automated way for neural architecture generation.

Upon a meticulous evaluation of the empirical results, it is evident that the proposed EL-NAS consistently outperforms all comparison algorithms, including Auto-CNN, across the board on all three examined datasets. Specifically focusing on the University of Pavia (UP) dataset, EL-NAS exhibits an exemplary classification accuracy of

98.72 %

. This result eclipses the performance metrics of other established algorithms as follows: it is

0.46 %

more accurate than the Spectral–Spatial Residual Network (SSRN) which scores

98.26 %

,

0.72 %

higher than DcCapsGAN with

98.00 %

,

0.45 %

greater than Lightweight Multiscale Attention Fusion Network (LMAFN) at

98.27 %

, and notably

1.59 %

superior to Auto-CNN, which has an accuracy of

97.13 %

. The superior performance of EL-NAS is due to its innovative integration of a lightweight structure, an attention module, and 3D decomposition convolutions. These elements work synergistically to enhance computational efficiency and focus on key features, contributing to its high classification accuracy. Moreover, EL-NAS leverages automated architecture search, avoiding manual design biases and delivering an optimized, resource-efficient model. This results in better performance metrics across all evaluated datasets, highlighting the algorithm’s efficacy and robustness.

In addition, Table 7 compares the parameter, and network depths of 2D-CNN, 3D-CNN, DFFN, SSRN, DcCapsGAN, LMAFN, and EL-NAS on the three datasets. From Table 7, based on the UP dataset, we can notice that EL-NAS has only 175657 parameters, which is

54.2 %

less than 443929 parameters of DFFN,

11.4 %

less than 229261 parameters of SSRN, and

99.1 %

less than 21468326 parameters of DcCapsGAN. While reducing the model’s size, EL-NAS decreases the network depth to 13 layers and presents the most satisfying accuracies for three different datasets. Table 8 presents the running time of DcCapsGAN, 2D CNN, 3D CNN, DFFN, SSRN, LMAFN, Auto-CNN, and EL-NAS, including searching time, training time, and test time. For the three data sets, our model runs 68.22 s, 62.66 s, and 71.39 s for searching, 87.81 s, 117.81 s, and 147.43 s for training, and 0.88 s, 3.42 s, and 1.28 s for testing, respectively. Note that we use more efficient and complex modules compared to Auto-CNN, so the searched network takes slightly longer to train and test. The execution time of EL-NAS surpasses that of all comparable handcrafted deep-learning algorithms, and its search time also outperforms that of Auto-CNN. This exceptional performance strongly attests to EL-NAS’s high efficiency in both memory utilization and computational overhead. This efficiency is largely attributed to the incorporation of lightweight modules and the expedited search process facilitated by intelligent edge decision-making.

Figure 13, Figure 14 and Figure 15 illustrate the full classification maps obtained from different algorithms on three HSI data sets. Pixel-based approaches SVMCK present more random noise and depict more errors, while SS-based approaches such as 2D-CNN, 3D-CNN, DFFN, SSRN, DcCapsGAN, and LMAFN demonstrate smoother results than pixel-based approaches. In addition, compared with other comparisons, LMAFN exhibits a smoother classification result and higher accuracy because of simultaneously considers spatial and continuous spectral features. Noteworthy, Auto-CNN can obtain precise classification results, which demonstrates the effectiveness of the auto-designed neural network for HSI classification. Nonetheless, when juxtaposed with the aforementioned algorithms, the proposed EL-NAS not only achieves superior accuracy and classification performance, but also does so with a reduced parameter count. This is accomplished through the synergistic integration of lightweight modules and an efficient architecture search algorithm, all underpinned by a highly effective automated architecture search process.

4.7. Cross Domain Experiment

In the second phase of our experiments, we aim to validate the cross-dataset and cross-sensor capabilities of our proposed EL-NAS framework. Specifically, we conduct tests under two distinct scenarios: a dataset-independent scenario, where the neural network architecture is optimized within the same sensor type but across different datasets, and a sensor-independent scenario, where the architecture is optimized across varying sensor types. To facilitate domain adaptation within the classification network, we have engineered dataset-specific classification layers in the latter stages of the network. Additionally, the convolutional layers preceding the shared cells are designed to adapt to diverse datasets.

4.7.1. Cross-Datasets Architecture Search of EL-NAS

In this section, we utilize the IN and SA datasets collected by the AVIRIS sensor for our experiments. EL-NAS is conducted on the IN dataset, and the optimal cell structure identified is then employed to construct the SA classification network. According to Table 9, using the IN dataset for searching yields classification accuracies of 94.70% and 95.99% on the SA dataset with 10 and 20 labeled samples per class, respectively. Conversely, using the SA dataset for searching results in accuracies of 88.60% and 90.39% on the IN dataset with 10 and 20 labeled samples per class, respectively.

The experimental results further substantiate the efficacy of the proposed EL-NAS method in key evaluation metrics. Notably, the use of a substantial auxiliary dataset (labeled as 10% IN or SA) for architecture searching not only matches but often surpasses the performance achieved using the target datasets. These findings offer an efficient methodology for automatic neural network architecture design across different application scenarios under the same acquisition sensor.

4.7.2. Cross-Sensors Architecture Search of EL-NAS

In this part, we adopt five datasets collected by four kinds of HSI acquisition sensors (i.e., IN and SA from AVIRIS, UP from ROSIS, HU from CASI, IMDB from Hyperspec-VNIR-C). We conduct architecture searching on one of the above datasets, and the classification network derived by the searched architecture is applied to other datasets. The experimental results of the search on HU are shown in Table 10. Our findings indicate that when target data volume is limited, the proposed EL-NAS method, utilizing a large auxiliary dataset (labeled as 10% HU), can achieve comparable or superior performance on key evaluation metrics, compared to using target datasets. These results offer an effective optimization strategy for cross-domain learning applications facing data scarcity, demonstrating that EL-NAS can automatically yield a neural network architecture design with satisfactory results even under different datasets collected by different acquisition sensors.

5. Conclusions

In this article, a novel EL-NAS is designed based on the gradient-based NAS manner to realize an efficient automatic way for the application of HSI classification. Meanwhile, the 3D decomposition convolution, lightweight structure, and attention module are considered to construct an efficient, lightweight attention searching space to accelerate the searching procedure and improve the searching results. Further for mitigating performance collapse caused by the number of skip connections in the architecture searching procedure, the edge decision and dynamic regularization are exploited through entropy probability distribution estimation of the non-skip operation and number of skip connections. Meanwhile, with the implementation of edge decisions and decrease in the weight sharing, the consistency in the searching and the evaluation procedure is ensured. In performance evaluation, we also construct a generalization loss to further improve the searching and classification performance. The experiments performed on three different HSI datasets demonstrate that the proposed EL-NAS outperforms other state-of-the-art comparison algorithms in classification accuracy, searching and computationally effectiveness, the number of parameters, and visual comparison performance. In cross scenario experiment, EL-NAS also indicates satisfying performance among different datasets collected by various acquisition sensors. The low searching, parameters, and computational burden of the proposed EL-NAS can further pave a new way for its practical application in HSI classification or edge computing application areas.

Author Contributions

Conceptualization, J.W. and J.H.; methodology, J.W. and J.H.; validation, Y.L., Z.H., S.H. and Y.Y.; investigation, J.W., J.H. and Y.L.; writing—original draft preparation, J.W. and J.H.; writing—review and editing, J.W. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant numbers 61801353 and 61977052, in part by GHfund B under grant number 202107020822 and 202202022633, and in part by the Project Supported by the China Postdoctoral Science Foundation funded project under grant number 2018M633474, and in part by the, and in part by the China Aerospace Science and Technology Corporation Joint Laboratory for Innovative Onboard Computer and Electronic Technologies under grant number 2023KFKT001-2.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Lacar, F.M.; Lewis, M.M.; Grierson, I.T. Use of hyperspectral imagery for mapping grape varieties in the Barossa Valley, South Australia. In Proceedings of the Geoscience and Remote Sensing Symposium, Sydney, Australia, 9–13 July 2001. [Google Scholar]
Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral Remote Sensing Data Analysis and Future Challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Zhang, F.; Wu, L.; Zhu, D.; Liu, Y. Social sensing from street-level imagery: A case study in learning spatio-temporal urban mobility patterns. ISPRS J. Photogramm. Remote Sens. 2019, 153, 48–58. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Tao, D.; Huang, X.; Du, B. Hyperspectral remote sensing image subpixel target detection based on supervised metric learning. IEEE Trans. Geosci. Remote Sens. 2013, 52, 4955–4965. [Google Scholar] [CrossRef]
Zhong, Y.; Wang, X.; Xu, Y.; Wang, S.; Jia, T.; Hu, X.; Zhao, J.; Wei, L.; Zhang, L. Mini-UAV-Borne Hyperspectral Remote Sensing: From Observation and Processing to Applications. IEEE Geosci. Remote Sens. Mag. 2018, 6, 46–62. [Google Scholar] [CrossRef]
Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Hyperspectral image classification via kernel sparse representation. IEEE Trans. Geosci. Remote Sens. 2012, 51, 217–231. [Google Scholar] [CrossRef]
Yi, C.; Nasrabadi, N.M.; Tran, T.D. Classification for hyperspectral imagery based on sparse representation. In Proceedings of the Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Reykjavik, Iceland, 14–16 June 2010. [Google Scholar]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Peng, J.; Zhou, Y.; Chen, C. Region-Kernel-Based Support Vector Machines for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4810–4824. [Google Scholar] [CrossRef]
Camps-Valls, G.; Bruzzone, L. Kernel-based methods for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2005, 43, 1351–1362. [Google Scholar] [CrossRef]
Wang, J.; Jiao, L.; Liu, H.; Yang, S. Hyperspectral Image Classification by Spatial–Spectral Derivative-Aided Kernel Joint Sparse Representation. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2485–2500. [Google Scholar] [CrossRef]
Wang, J.; Jiao, L.; Shuang, W.; Hou, B.; Fang, L. Adaptive Nonlocal Spatial–Spectral Kernel for Hyperspectral Imagery Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 1–16. [Google Scholar] [CrossRef]
Saxena, L. Recent advances in deep learning. Comput. Rev. 2016, 57, 563–564. [Google Scholar]
Zhang, H.; Li, Y.; Zhang, Y.; Shen, Q. Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network. Remote Sens. Lett. 2017, 8, 438–447. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral-Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
Slavkovikj, V.; Verstockt, S.; Neve, W.D.; Hoecke, S.V.; Walle, R. Hyperspectral Image Classification with Convolutional Neural Networks. In Proceedings of the the 23rd ACM International Conference, Montreal, QC, Canada, 18–22 October 2021. [Google Scholar]
He, M.; Bo, L.; Chen, H. Multi-scale 3D deep convolutional neural network for hyperspectral image classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
Mou, L.; Ghamisi, P.; Zhu, X.X. Unsupervised Spectral-Spatial Feature Learning via Deep Residual Conv-Deconv Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 391–406. [Google Scholar] [CrossRef]
Song, W.; Li, S.; Fang, L.; Lu, T. Hyperspectral Image Classification With Deep Feature Fusion Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3173–3184. [Google Scholar] [CrossRef]
Xi, B.; Li, J.; Diao, Y.; Li, Y.; Li, Z.; Huang, Y.; Chanussot, J. DGSSC: A Deep Generative Spectral-Spatial Classifier for Imbalanced Hyperspectral Imagery. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1535–1548. [Google Scholar] [CrossRef]
Zhang, H.; Li, Y.; Jiang, Y.; Wang, P.; Shen, C. Hyperspectral Classification Based on Lightweight 3-D-CNN with Transfer Learning. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5813–5828. [Google Scholar] [CrossRef]
Wang, J.; Guo, S.; Huang, R.; Li, L.; Jiao, L. Dual-Channel Capsule Generation Adversarial Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. QTN: Quaternion Transformer Network for Hyperspectral Image Classification. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
Wang, J.; Huang, R.; Guo, S.; Li, L.; Zhu, M.; Yang, S.; Jiao, L. NAS-Guided Lightweight Multiscale Attention Fusion Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8754–8767. [Google Scholar] [CrossRef]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollar, P. Designing Network Design Spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2020. [Google Scholar]
Liu, B.; Yu, X.; Yu, A.; Wan, G. Deep convolutional recurrent neural network with transfer learning for hyperspectral image classification. J. Appl. Remote Sens. 2018, 12, 026028. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Ghamisi, P. Heterogeneous transfer learning for hyperspectral image classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3246–3263. [Google Scholar] [CrossRef]
Liu, X.; Hu, Q.; Cai, Y.; Cai, Z. Extreme learning machine-based ensemble transfer learning for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3892–3902. [Google Scholar] [CrossRef]
Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up Convolutional Neural Networks with Low Rank Expansions. arXiv 2014, arXiv:1405.3866. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. arXiv 2021, arXiv:2103.15808. [Google Scholar]
Leiva-Aravena, E.; Leiva, E.; Zamorano, V.; Rojas, C.; John, M. Neural Architecture Search with Reinforcement Learning. arXiv. 2019, arXiv:1611.01578. [Google Scholar]
Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4095–4104. [Google Scholar]
Baker, B.; Gupta, O.; Naik, N.; Raskar, R. Designing neural network architectures using reinforcement learning. arXiv 2016, arXiv:1611.02167. [Google Scholar]
Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-scale evolution of image classifiers. In Proceedings of the International Conference on Machine Learning, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 2902–2911. [Google Scholar]
Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; Kavukcuoglu, K. Hierarchical representations for efficient architecture search. arXiv 2017, arXiv:1711.00436. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
Li, C.; Ning, J.; Hu, H.; He, K. Enhancing the Robustness, Efficiency, and Diversity of Differentiable Architecture Search. arXiv 2022, arXiv:2204.04681. [Google Scholar]
Xia, X.; Xiao, X.; Wang, X.; Zheng, M. Progressive Automatic Design of Search Space for One-Shot Neural Architecture Search. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2455–2464. [Google Scholar]
Liu, Y.; Li, T.; Zhang, P.; Yan, Y. Improved conformer-based end-to-end speech recognition using neural architecture search. arXiv 2021, arXiv:2104.05390. [Google Scholar]
Li, H.; Wu, G.; Zheng, W.S. Combined depth space based architecture search for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6729–6738. [Google Scholar]
Zhang, H.; Gong, C.; Bai, Y.; Bai, Z.; Li, Y. 3-D-ANAS: 3-D Asymmetric Neural Architecture Search for Fast Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–19. [Google Scholar] [CrossRef]
Xue, X.; Zhang, H.; Fang, B.; Bai, Z.; Li, Y. Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification. arXiv 2021, arXiv:2110.11084. [Google Scholar]
Liang, H.; Zhang, S.; Sun, J.; He, X.; Huang, W.; Zhuang, K.; Li, Z. Darts+: Improved differentiable architecture search with early stopping. arXiv 2019, arXiv:1909.06035. [Google Scholar]
Xu, Y.; Xie, L.; Zhang, X.; Chen, X.; Qi, G.J.; Tian, Q.; Xiong, H. PC-DARTS: Partial channel connections for memory-efficient architecture search. arXiv 2019, arXiv:1907.05737. [Google Scholar]
Chu, X.; Wang, X.; Zhang, B.; Lu, S.; Wei, X.; Yan, J. DARTS-: Robustly stepping out of performance collapse without indicators. arXiv 2020, arXiv:2009.01027. [Google Scholar]
Li, G.; Qian, G.; Delgadillo, I.C.; Muller, M.; Thabet, A.; Ghanem, B. Sgas: Sequential greedy architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1620–1630. [Google Scholar]
Chu, X.; Zhang, B.; Xu, R. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 12239–12248. [Google Scholar]
Hou, P.; Jin, Y.; Chen, Y. Single-DARTS: Towards Stable Architecture Search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 373–382. [Google Scholar]
Zela, A.; Elsken, T.; Saikia, T.; Marrakchi, Y.; Brox, T.; Hutter, F. Understanding and Robustifying Differentiable Architecture Search. arXiv 2019, arXiv:1909.09656. [Google Scholar]
Ye, P.; Li, B.; Li, Y.; Chen, T.; Fan, J.; Ouyang, W. beta-DARTS: Beta-Decay Regularization for Differentiable Architecture Search. arXiv 2022, arXiv:2203.01665. [Google Scholar]
Huang, L.; Sun, S.; Zeng, J.; Wang, W.; Pang, W.; Wang, K. U-DARTS: Uniform-space differentiable architecture search. Inf. Sci. 2023, 628, 339–349. [Google Scholar] [CrossRef]
Wang, W.; Zhang, X.; Cui, H.; Yin, H.; Zhang, Y. FP-DARTS: Fast parallel differentiable neural architecture search for image classification. Pattern Recognit. 2023, 136, 109193. [Google Scholar] [CrossRef]
Zhang, C.; Liu, X.; Wang, G.; Cai, Z. Particle Swarm Optimization Based Deep Learning Architecture Search for Hyperspectral Image Classification. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 509–512. [Google Scholar]
Liu, X.; Zhang, C.; Cai, Z.; Yang, J.; Zhou, Z.; Gong, X. Continuous Particle Swarm Optimization-Based Deep Learning Architecture Search for Hyperspectral Image Classification. Remote Sens. 2021, 13, 1082. [Google Scholar] [CrossRef]
Chen, Y.; Zhu, K.; Zhu, L.; He, X.; Ghamisi, P.; Benediktsson, J.A. Automatic design of convolutional neural network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7048–7066. [Google Scholar] [CrossRef]
Zhan, L.; Fan, J.; Ye, P.; Cao, J. A2S-NAS: Asymmetric Spectral-Spatial Neural Architecture Search for Hyperspectral Image Classification. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
Cao, C.; Xiang, H.; Song, W.; Yi, H.; Xiao, F.; Gao, X. Lightweight Multiscale Neural Architecture Search With Spectral–Spatial Attention for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]

Figure 1. The search framework of proposed EL-NAS for HSI classification.

Figure 2. The whole modular search space and searching network of the proposed EL-NAS.

Figure 3. The principle of 3D convolution decomposition.

Figure 4. The performance impact of the number of skip connections on Pavia.

Figure 5. The searching workflow works with edge decision with dynamic regularization.

Figure 6. IN. (a) False-color image. (b) Ground-truth map.

Figure 7. UP. (a) False-color image. (b) Ground-truth map.

Figure 8. HU. (a) False-color image. (b) Ground-truth map.

Figure 9. SA. (a) False-color image. (b) Ground-truth map.

Figure 10. IMDB. (a) False-color image. (b) Ground-truth map.

Figure 11. The number of skip connections in the operations selected in ten independent searches.

Figure 12. Best cell architecture on different dataset settings. (a) IN; (b) UP; (c) HU.

Figure 13. Classification maps for IN. (a) False-color image; (b) ground-truth map; (c) SVMCK; (d) 2D-CNN; (e) 3D-CNN; (f) DFFN; (g) SSRN; (h) DcCapsGAN; (i) AUTO-CNN; (j) LAMFN; (k) EL-NAS.

Figure 14. Classification maps for UP. (a) False-color image; (b) ground-truth map; (c) SVMCK; (d) 2D-CNN; (e) 3D-CNN; (f) DFFN; (g) SSRN; (h) DcCapsGAN; (i) AUTO-CNN; (j) LMAFN; (k) EL-NAS.

Figure 15. Classification maps for HU. (a) False-color image; (b) ground-truth map; (c) SVM; (d) SVMCK; (e) 2D-CNN; (f) 3D-CNN; (g) DFFN; (h) SSRN; (i) DcCapsGAN; (j) AUTO-CNN; (k) EL-NAS.

Table 1. Sample setup of IN, UP and HU data sets.

IP				UP						HU
Class	Class Name	Train	Test	#	Class	Class Name	Train	Test	#	Class	Class Name	Train	Test
1	Alfalfa	2	51	#	1	Asphalt	67	6963	#	1	Healthy grass	38	1314
2	Corn-notill	43	1571	#	2	Meadows	187	19,582	#	2	Stressed grass	38	1317
3	Corn-mintill	25	913	#	3	Gravel	21	2204	#	3	Synthetic grass	21	732
4	Corn	8	261	#	4	Trees	31	3218	#	4	Trees	38	1307
5	Grass-pasture	15	532	#	5	Sheets	14	1413	#	5	Soil	38	1305
6	Grass-trees	22	803	#	6	Baresoil	51	5281	#	6	Water	10	342
7	Grass-pasture-mowed	1	31	#	7	Bitumen	14	1397	#	7	Residential	39	1332
8	Hay-windrowed	15	526	#	8	Bricks	37	3867	#	8	Commercial	38	1307
9	Oats	1	22	#	9	Shadows	10	995	#	9	Road	38	1315
10	Soybean-nottill	30	1070	#					#	10	Highway	37	1289
11	Soybean-minttill	74	2701	#					#	11	Railway	38	1297
12	Soybean-clean	18	653	#					#	12	Parking Lot 1	37	1295
13	Wheat	7	226	#					#	13	Parking Lot 2	15	493
14	Woods	38	1392	#					#	14	Tennis Court	13	450
15	Buildings-Grass-Trees-Drives	12	425	#					#	15	Running Track	20	693
16	Stone-Steel-Towers	3	103	#					#
Total		314	9935	#	Total		432	42,344	#	Total		458	15,788

Table 2. The performance of different candidate operations.

Condidate Operations	OA	AA	KAPPA
BASE(dilconv+sepconv)	$98.00 \pm 0.02$	$97.59 \pm 0.07$	$97.34 \pm 0.03$
IR	$98.10 \pm 0.04$	$97.57 \pm 0.14$	$97.47 \pm 0.05$
IR+BASE	$98.12 \pm 0.16$	$97.73 \pm 0.08$	$97.50 \pm 0.22$
IR+SE	$98.21 \pm 0.12$	$97.82 \pm 0.11$	$97.61 \pm 0.17$
IR+pointconv	$98.07 \pm 0.13$	$97.72 \pm 0.11$	$97.43 \pm 0.18$
IR+SPA	$98.14 \pm 0.04$	$97.83 \pm 0.07$	$97.52 \pm 0.05$
IR+SPE	$98.14 \pm 0.07$	$97.82 \pm 0.03$	$97.52 \pm 0.10$
IR+SPA+SPE	$98.16 \pm 0.04$	$97.76 \pm 0.16$	$97.55 \pm 0.05$
MSS	$98.27 \pm 0.09$	$97.82 \pm 0.13$	$97.69 \pm 0.12$

Table 3. The impact of different strategic optimization scheme in three individual searches.

	$L_{CE}$				$L_{CE} + L_{β}$				$L_{CE} + L_{β} + L_{g}$				$L_{CE} + L_{β} + L_{g} + DR (δ)$
Exp	1	2	3	Mean	1	2	3	Mean	1	2	3	Mean	1	2	3	Mean
OA(%)	98.30	98.36	98.27	98.31	98.46	98.47	98.41	98.45	98.79	98.85	98.80	98.81	98.80	98.86	98.81	98.82
AA(%)	97.49	97.68	97.76	97.64	98.19	97.97	97.90	98.02	98.24	98.44	98.40	98.36	98.29	98.44	98.41	98.38
KAPPA(%)	97.73	97.80	97.81	97.78	98.24	98.02	98.41	98.22	98.39	98.46	98.40	98.42	98.39	98.47	98.41	98.42

Table 4. Classification results of different methods for labeled pixels of the IN data set.

Class	SVMCK	2D-CNN	3D-CNN	DFFN	SSRN	DcCapsGAN	Auto-CNN	LMAFN	EL-NAS
1	$79.23 \pm 6.82$	$78.79 \pm 1.31$	$30.37 \pm 2.57$	$90.00 \pm 8.70$	$76.26 \pm 20.00$	$11.19 \pm 0.15$	$68.94 \pm 2.14$	$83.18 \pm 15.85$	$68.18 \pm 6.69$
2	$83.60 \pm 2.97$	$72.68 \pm 1.45$	$89.58 \pm 1.15$	$88.35 \pm 3.54$	$91.98 \pm 5.03$	$90.49 \pm 0.25$	$88.95 \pm 2.40$	$91.95 \pm 3.40$	$90.57 \pm 0.36$
3	$83.94 \pm 5.53$	$77.31 \pm 0.72$	$64.68 \pm 1.71$	$87.17 \pm 5.36$	$93.07 \pm 4.21$	$96.48 \pm 0.19$	$81.99 \pm 0.83$	$90.52 \pm 3.77$	$95.57 \pm 0.76$
4	$79.65 \pm 5.65$	$77.44 \pm 2.41$	$53.19 \pm 1.09$	$92.40 \pm 4.30$	$77.47 \pm 13.02$	$72.13 \pm 0.83$	$79.91 \pm 2.34$	$93.01 \pm 5.07$	$87.05 \pm 3.03$
5	$92.70 \pm 2.95$	$81.41 \pm 1.07$	$75.34 \pm 0.12$	$87.99 \pm 6.43$	$99.75 \pm 0.35$	$94.67 \pm 0.01$	$93.38 \pm 0.63$	$90.60 \pm 4.37$	$92.09 \pm 1.51$
6	$92.73 \pm 4.14$	$85.12 \pm 1.34$	$98.02 \pm 0.24$	$91.45 \pm 6.79$	$98.39 \pm 1.47$	$99.15 \pm 0.14$	$99.44 \pm 0.12$	$98.32 \pm 1.58$	$96.56 \pm 1.16$
7	$95.20 \pm 1.60$	$4.94 \pm 2.14$	$44.44 \pm 0.29$	$77.41 \pm 23.68$	$86.32 \pm 9.86$	$32.10 \pm 2.14$	$60.49 \pm 14.29$	$90.00 \pm 16.81$	$96.30 \pm 3.02$
8	$97.59 \pm 1.66$	$99.78 \pm 0.08$	$100.00 \pm 0.00$	$99.00 \pm 1.54$	$97.41 \pm 3.66$	$99.93 \pm 0.12$	$100.00 \pm 0.00$	$99.09 \pm 0.96$	$100.00 \pm 0.00$
9	$54.74 \pm 21.98$	$68.42 \pm 13.93$	$22.81 \pm 6.08$	$81.58 \pm 20.14$	$93.94 \pm 8.57$	$47.37 \pm 5.26$	$78.95 \pm 15.49$	$67.37 \pm 24.21$	$89.47 \pm 11.37$
10	$84.80 \pm 4.55$	$79.26 \pm 4.29$	$71.51 \pm 0.58$	$88.41 \pm 5.20$	$89.31 \pm 11.89$	$91.16 \pm 0.13$	$83.72 \pm 1.97$	$92.31 \pm 4.36$	$93.74 \pm 0.43$
11	$89.39 \pm 2.14$	$91.01 \pm 1.24$	$90.54 \pm 0.42$	$94.72 \pm 1.06$	$91.27 \pm 5.54$	$96.82 \pm 0.06$	$90.02 \pm 1.61$	$95.30 \pm 1.55$	$96.72 \pm 0.72$
12	$75.36 \pm 8.85$	$88.99 \pm 2.73$	$89.68 \pm 1.23$	$90.52 \pm 3.19$	$71.68 \pm 11.96$	$78.03 \pm 0.10$	$80.58 \pm 1.97$	$93.11 \pm 3.93$	$96.93 \pm 0.57$
13	$99.02 \pm 0.62$	$96.30 \pm 0.58$	$99.50 \pm 0.47$	$90.51 \pm 8.64$	$96.67 \pm 2.89$	$99.66 \pm 0.58$	$99.33 \pm 0.24$	$98.84 \pm 0.96$	$99.66 \pm 0.24$
14	$95.86 \pm 1.62$	$89.95 \pm 0.53$	$95.65 \pm 0.05$	$96.81 \pm 1.88$	$94.93 \pm 3.12$	$99.13 \pm 0.12$	$98.91 \pm 0.20$	$97.74 \pm 2.16$	$99.59 \pm 0.20$
15	$82.77 \pm 6.47$	$88.32 \pm 3.21$	$75.40 \pm 0.80$	$90.27 \pm 6.77$	$86.12 \pm 8.92$	$83.87 \pm 0.31$	$81.64 \pm 1.97$	$94.39 \pm 6.87$	$85.29 \pm 0.87$
16	$95.22 \pm 4.79$	$75.56 \pm 4.84$	$72.22 \pm 0.52$	$58.89 \pm 8.37$	$93.33 \pm 5.18$	$98.52 \pm 0.64$	$88.15 \pm 14.41$	$95.22 \pm 6.07$	$88.89 \pm 1.01$
OA(%)	$88.17 \pm 1.30$	$84.74 \pm 0.30$	$85.40 \pm 0.17$	$91.57 \pm 1.35$	$89.98 \pm 0.50$	$93.14 \pm 0.02$	$89.90 \pm 0.80$	$94.37 \pm 0.71$	$94.96 \pm 0.25$
AA(%)	$86.36 \pm 2.17$	$78.46 \pm 1.31$	$73.31 \pm 0.48$	$87.84 \pm 3.33$	$89.87 \pm 1.18$	$80.67 \pm 0.35$	$89.55 \pm 1.40$	$91.94 \pm 3.15$	$93.39 \pm 0.68$
KAPPA(%)	$86.52 \pm 1.48$	$82.55 \pm 0.38$	$83.25 \pm 0.20$	$90.38 \pm 1.56$	$88.58 \pm 0.51$	$92.16 \pm 0.02$	$88.37 \pm 0.91$	$93.58 \pm 0.82$	$94.20 \pm 0.29$
PARAM	-	186,096	9068	374,880	376,892	33,521,328	176,299	148,651	274,613

Table 5. Classification results of different methods for labeled pixels of the UP data set.

Class	SVMCK	2D-CNN	3D-CNN	DFFN	SSRN	DcCapsGAN	Auto-CNN	LMAFN	EL-NAS
1	$92.82 \pm 2.17$	$91.04 \pm 2.77$	$96.81 \pm 2.18$	$98.21 \pm 1.29$	$99.04 \pm 0.22$	$99.22 \pm 0.07$	$93.88 \pm 0.84$	$99.14 \pm 0.72$	$99.67 \pm 0.06$
2	$98.94 \pm 0.49$	$97.49 \pm 1.62$	$90.00 \pm 15.18$	$99.47 \pm 0.45$	$99.54 \pm 0.24$	$99.93 \pm 0.03$	$99.96 \pm 0.03$	$99.42 \pm 0.40$	$99.91 \pm 0.04$
3	$86.40 \pm 1.84$	$70.97 \pm 8.52$	$86.15 \pm 4.14$	$92.37 \pm 5.47$	$98.76 \pm 1.31$	$82.23 \pm 0.07$	$86.11 \pm 0.61$	$91.50 \pm 4.23$	$91.63 \pm 1.38$
4	$93.60 \pm 1.24$	$92.32 \pm 3.46$	$93.40 \pm 7.07$	$87.71 \pm 2.56$	$99.95 \pm 0.04$	$97.92 \pm 0.03$	$93.67 \pm 0.69$	$96.11 \pm 1.35$	$93.86 \pm 0.76$
5	$99.28 \pm 0.34$	$99.22 \pm 0.44$	$96.99 \pm 4.96$	$95.18 \pm 4.99$	$99.95 \pm 0.04$	$99.90 \pm 0.11$	$99.95 \pm 0.04$	$99.80 \pm 0.32$	$99.55 \pm 0.06$
6	$93.98 \pm 1.65$	$79.90 \pm 7.80$	$89.64 \pm 1.50$	$99.40 \pm 0.93$	$98.79 \pm 1.47$	$95.81 \pm 0.03$	$98.33 \pm 0.22$	$98.80 \pm 1.13$	$99.72 \pm 0.15$
7	$90.62 \pm 1.97$	$71.70 \pm 6.69$	$88.18 \pm 5.28$	$96.47 \pm 3.16$	$99.79 \pm 0.16$	$94.74 \pm 0.12$	$89.08 \pm 1.71$	$96.89 \pm 2.58$	$99.70 \pm 0.00$
8	$93.08 \pm 1.56$	$92.59 \pm 4.63$	$87.03 \pm 3.41$	$96.83 \pm 2.54$	$88.54 \pm 2.78$	$97.73 \pm 0.11$	$97.42 \pm 0.72$	$95.25 \pm 1.96$	$97.60 \pm 0.36$
9	$87.92 \pm 5.22$	$92.93 \pm 2.35$	$99.75 \pm 0.16$	$72.57 \pm 5.23$	$97.22 \pm 1.49$	$99.61 \pm 0.12$	$99.40 \pm 0.13$	$99.18 \pm 0.81$	$96.69 \pm 0.31$
OA(%)	$95.41 \pm 0.50$	$91.48 \pm 2.10$	$94.21 \pm 0.44$	$97.03 \pm 0.70$	$98.26 \pm 0.07$	$97.97 \pm 0.03$	$97.13 \pm 0.05$	$98.27 \pm 0.49$	$98.72 \pm 0.12$
AA(%)	$92.96 \pm 0.68$	$87.57 \pm 1.97$	$91.99 \pm 0.82$	$93.13 \pm 1.21$	$97.95 \pm 0.17$	$96.34 \pm 0.03$	$96.15 \pm 0.14$	$97.34 \pm 0.75$	$98.26 \pm 0.09$
KAPPA(%)	$93.91 \pm 0.66$	$88.62 \pm 2.81$	$92.27 \pm 0.59$	$96.06 \pm 0.92$	$97.69 \pm 0.09$	$97.30 \pm 0.04$	$96.16 \pm 0.07$	$97.73 \pm 0.46$	$98.30 \pm 0.15$
PARAM	-	185,193	5253	443,929	229,261	21,468,326	156,101	140,260	175,657

Table 6. Classification results of different methods for labeled pixels of the HU data set.

Class	SVMCK	2D-CNN	3D-CNN	DFFN	SSRN	DcCapsGAN	Auto-CNN	LMAFN	EL-NAS
1	$97.16 \pm 1.74$	$90.74 \pm 2.19$	$96.37 \pm 0.59$	$93.62 \pm 2.32$	$97.97 \pm 2.37$	$98.68 \pm 0.22$	$98.10 \pm 0.18$	$97.54 \pm 2.47$	$97.80 \pm 0.31$
2	$95.99 \pm 3.32$	$87.21 \pm 3.01$	$96.38 \pm 1.08$	$92.80 \pm 3.11$	$95.34 \pm 4.60$	$98.08 \pm 0.17$	$98.88 \pm 0.10$	$98.60 \pm 1.30$	$99.29 \pm 0.04$
3	$99.62 \pm 0.48$	$96.38 \pm 0.60$	$98.67 \pm 1.31$	$97.91 \pm 2.32$	$100 \pm 0.00$	$97.58 \pm 0.23$	$97.58 \pm 1.06$	$99.78 \pm 0.39$	$99.70 \pm 0.21$
4	$92.02 \pm 4.06$	$93.93 \pm 0.84$	$96.85 \pm 3.74$	$85.61 \pm 3.77$	$99.66 \pm 0.32$	$94.09 \pm 0.13$	$99.83 \pm 0.12$	$98.26 \pm 1.83$	$99.81 \pm 0.17$
5	$98.32 \pm 1.43$	$97.88 \pm 0.69$	$99.72 \pm 0.19$	$99.57 \pm 0.98$	$95.46 \pm 2.30$	$99.81 \pm 0.17$	$100.00 \pm 0.00$	$99.65 \pm 0.62$	$100.00 \pm 0.00$
6	$90.35 \pm 4.07$	$70.39 \pm 1.86$	$97.04 \pm 2.88$	$88.70 \pm 5.08$	$100 \pm 0.00$	$85.71 \pm 0.32$	$90.48 \pm 4.33$	$93.08 \pm 5.00$	$94.71 \pm 2.49$
7	$91.29 \pm 4.07$	$95.09 \pm 1.82$	$86.37 \pm 3.48$	$92.70 \pm 5.25$	$94.68 \pm 2.33$	$95.77 \pm 0.21$	$93.38 \pm 1.03$	$94.85 \pm 2.77$	$95.66 \pm 0.50$
8	$86.98 \pm 2.43$	$76.07 \pm 4.30$	$73.76 \pm 3.27$	$87.31 \pm 5.20$	$97.48 \pm 1.42$	$76.88 \pm 0.38$	$85.19 \pm 0.57$	$90.37 \pm 2.43$	$85.21 \pm 0.55$
9	$88.90 \pm 6.12$	$86.43 \pm 2.82$	$79.63 \pm 0.96$	$87.95 \pm 3.65$	$94.76 \pm 0.91$	$90.83 \pm 0.34$	$81.82 \pm 1.48$	$89.77 \pm 2.82$	$84.71 \pm 0.17$
10	$89.53 \pm 2.77$	$85.53 \pm 4.76$	$84.52 \pm 7.45$	$97.38 \pm 1.67$	$90.60 \pm 3.71$	$97.81 \pm 0.17$	$99.19 \pm 0.17$	$97.12 \pm 1.60$	$100.00 \pm 0.00$
11	$86.70 \pm 2.01$	$82.85 \pm 1.87$	$88.62 \pm 4.95$	$96.54 \pm 3.13$	$85.88 \pm 6.22$	$89.51 \pm 0.04$	$95.66 \pm 0.59$	$96.56 \pm 2.31$	$97.77 \pm 0.49$
12	$88.95 \pm 2.18$	$86.18 \pm 3.59$	$86.32 \pm 2.28$	$92.58 \pm 2.63$	$90.77 \pm 7.53$	$97.07 \pm 0.44$	$96.15 \pm 0.43$	$97.43 \pm 1.62$	$97.83 \pm 0.58$
13	$76.56 \pm 2.37$	$89.96 \pm 0.67$	$76.41 \pm 13.92$	$93.37 \pm 5.23$	$94.50 \pm 0.78$	$79.47 \pm 0.32$	$96.77 \pm 1.71$	$92.56 \pm 4.09$	$96.92 \pm 1.40$
14	$97.93 \pm 1.89$	$92.31 \pm 0.57$	$95.74 \pm 5.75$	$99.93 \pm 2.40$	$100 \pm 0.00$	$99.92 \pm 0.14$	$99.68 \pm 0.11$	$98.10 \pm 2.45$	$99.68 \pm 0.30$
15	$99.88 \pm 0.17$	$83.94 \pm 3.06$	$98.91 \pm 1.09$	$97.39 \pm 2.93$	$98.33 \pm 1.37$	$99.84 \pm 0.16$	$99.90 \pm 0.15$	$99.92 \pm 0.23$	$100.00 \pm 0.00$
OA(%)	$92.02 \pm 0.81$	$88.19 \pm 0.27$	$89.76 \pm 0.65$	$93.20 \pm 0.58$	$94.59 \pm 0.32$	$93.84 \pm 0.14$	$95.26 \pm 0.18$	$96.24 \pm 0.66$	$96.28 \pm 0.12$
AA(%)	$92.01 \pm 0.82$	$87.66 \pm 0.37$	$90.37 \pm 1.09$	$93.56 \pm 0.60$	$95.70 \pm 0.17$	$93.40 \pm 0.13$	$95.67 \pm 0.08$	$96.24 \pm 0.73$	$96.65 \pm 0.29$
KAPPA(%)	$91.37 \pm 0.87$	$87.22 \pm 0.30$	$88.93 \pm 0.71$	$92.65 \pm 0.62$	$94.16 \pm 0.35$	$93.33 \pm 0.15$	$94.84 \pm 0.20$	$95.94 \pm 0.72$	$95.95 \pm 0.13$
PARAM	-	185,967	8523	375,103	290,851	27,055,608	172,373	143,658	238,292

Table 7. Parameter, depth of different models for three data sets.

Dataset	IN		UP		HU
Model	Patameter	Depth	Patameter	Depth	Patameter	Depth
2D-CNN	186,096	3	185,193	3	185,967	3
3D-CNN	9068	3	5253	3	8523	3
DFFN	374,880	27	443,929	33	375,103	27
SSRN	376,892	13	229,261	13	290,851	13
DcCapsGAN	33,521,328	/	21,468,326	/	27,055,608	/
LMAFN	148,651	57	140,260	57	143,658	57
EL-NAS	274,613	13	175,657	13	238,292	13

Table 8. Running times (s) for three datasets: ’Training’ refers to the total duration required for model training. ’Test’ denotes the complete time taken for testing. ’Searching’ indicates the time needed to complete a search.

	IN			UP			HU
	Searching	Training	Test	Searching	Training	Test	Searching	Training	Test
DcCapsGAN	-	148.07	22.83	-	68.49	41.08	-	125.59	23.38
2D-CNN	-	18.43	4.96	-	10.53	11.21	-	16.55	5.32
3D-CNN	-	50.54	3.51	-	23.53	4.56	-	43.29	3.03
DFFN	-	337.72	1.10	-	376.53	3.98	-	350.40	1.54
SSRN	-	227.34	10.47	-	290.77	27.38	-	350.66	11.55
LMAFN	-	171.23	0.82	-	156.43	1.73	-	161.13	0.83
Auto-CNN	82.43	86.22	0.82	73.56	108.98	2.88	89.41	132.00	1.17
EL-NAS	68.22	87.81	0.88	62.66	117.81	3.42	71.39	147.43	1.28

Table 9. Classification results of cross-datasets architecture search of EL-NAS.

Evaluate Data	SA 10		SA 20		IN 10		IN 20
Search Data	SA 10	IN 10%	SA 20	IN 10%	IN 10	SA 10%	IN 20	SA 10%
OA(%)	94.10	94.70	95.55	95.99	87.95	88.60	90.00	90.39
AA(%)	96.00	96.25	96.80	96.73	87.90	88.51	86.30	86.41
KAPPA(%)	94.20	94.07	95.60	95.50	86.80	86.94	88.10	88.97

Table 10. Classification results of cross-sensors architecture search of EL-NAS.

Evaluate Data	IMDB 10		IMDB 20		IN 10		IN 20		UP 10		UP 20		SA 10		SA 20
Search Data	IMDB 10	HU 10%	IMDB 20	HU 10%	IN 10	HU 10%	IN 20	HU 10%	UP 10	HU 10%	UP 20	HU 10%	SA 10	HU 10%	SA 20	HU 10%
OA(%)	97.1	97.8	99.0	99.3	86.9	88.7	90.2	89.0	91.5	91.7	91.9	92.1	94.0	95.8	96.7	96.5
AA(%)	95.2	96.4	97.3	97.6	85.8	87.6	87.3	86.1	88.8	87.2	87.8	88.5	97.0	96.9	97.2	98.1
KAPPA(%)	96.9	97.8	98.9	99.2	84.9	87.1	89.3	87.7	87.9	89.0	89.1	90.2	93.7	95.1	94.4	95.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Hu, J.; Liu, Y.; Hua, Z.; Hao, S.; Yao, Y. EL-NAS: Efficient Lightweight Attention Cross-Domain Architecture Search for Hyperspectral Image Classification. Remote Sens. 2023, 15, 4688. https://doi.org/10.3390/rs15194688

AMA Style

Wang J, Hu J, Liu Y, Hua Z, Hao S, Yao Y. EL-NAS: Efficient Lightweight Attention Cross-Domain Architecture Search for Hyperspectral Image Classification. Remote Sensing. 2023; 15(19):4688. https://doi.org/10.3390/rs15194688

Chicago/Turabian Style

Wang, Jianing, Jinyu Hu, Yichen Liu, Zheng Hua, Shengjia Hao, and Yuqiong Yao. 2023. "EL-NAS: Efficient Lightweight Attention Cross-Domain Architecture Search for Hyperspectral Image Classification" Remote Sensing 15, no. 19: 4688. https://doi.org/10.3390/rs15194688

APA Style

Wang, J., Hu, J., Liu, Y., Hua, Z., Hao, S., & Yao, Y. (2023). EL-NAS: Efficient Lightweight Attention Cross-Domain Architecture Search for Hyperspectral Image Classification. Remote Sensing, 15(19), 4688. https://doi.org/10.3390/rs15194688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EL-NAS: Efficient Lightweight Attention Cross-Domain Architecture Search for Hyperspectral Image Classification

Abstract

1. Introduction

2. Related Work

2.1. GD-Based NAS

2.2. NAS for HSI

3. Methodology

3.1. Modular Search Space

3.2. Regularization-Based Edge-Decision Search Strategy

3.2.1. Bi-Level Optimization

3.2.2. Edge Decision Criterion

Edge Importance

Selection Certainty

3.2.3. Dynamic Regularization (DR)

3.3. Performance Evaluation

4. Experiments

4.1. Hyperspectral Data Sets

4.2. Experimental Configuration

4.3. Search Space Configuration

4.4. Hyperparameter Settings

4.5. Ablation Study

4.5.1. Different Candidate Operations

4.5.2. Strategy Optimisation Scheme

4.6. Architecture Evaluation

4.7. Cross Domain Experiment

4.7.1. Cross-Datasets Architecture Search of EL-NAS

4.7.2. Cross-Sensors Architecture Search of EL-NAS

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI