1. Introduction
In recent years, with the rapid development of remote sensing image sensor technology, the data volume and spatial resolution of high-resolution remote sensing images (HRRSIs) have been greatly improved [
1,
2,
3,
4]. However, the types of remote sensing image sensors, imaging and illumination conditions, shooting heights, and angles have led to huge differences in the distribution of HRRSI [
5,
6,
7]. Due to the different distributions of HRRSI, the generalization ability of a trained model to be applied to other new data samples is limited. How to improve the generalization ability of models under different distributed datasets has become a significant issue in current research [
8,
9,
10].
To effectively alleviate the problem of the weak generalization ability of a model, scholars have proposed a cross-domain scene classification model. This model mainly classifies label-sparse data (target domain) based on the knowledge learned from label-rich data (source domain), where the data in the source and target domains come from different distributions [
11,
12]. The domain adaptation method is one of the most widely used methods in cross-domain scene classification models [
13,
14]. This method maps different scenes to a common feature space [
15] and assumes that the source domain and the target domain share the same category space but have different data probability distributions (domain offset) [
13,
16]. However, in practical application scenarios, it is difficult to find a source domain that can cover all categories of the target domain [
17].
The most straightforward way to effectively alleviate domain offset is to transform the source domain and the target domain so that the different data distributions are closer together. According to the characteristics of data distribution, the method mainly includes (1) conditional distribution adaptation [
18]; (2) marginal distribution adaptation [
19,
20]; and (3) joint distribution adaptation [
21]. Othman et al. [
13] proposed small-batch gradient-based dynamic sample optimization to reduce the difference between marginal and conditional distributions. Adversarial learning is also a common method of domain adaptation, which mainly reduces domain offset through a minimax game between the generator and the discriminator [
22]. Shen et al. [
23] proposed adversarial learning based on Wasserstein distance to learn domain-invariant features. Yu et al. [
24] proposed an adaptation model based on dynamic adversarial learning, which utilizes A-distance to set the weights of marginal and conditional distributions.
The successful model mentioned above provides us with a sufficient theoretical basis for our research and has achieved impressive performance. However, for the study of remote sensing cross-domain scene classification, the above models still face the following challenges:
As shown in
Figure 1, due to the diversity of HRRSI generated by factors such as different heights, scales, seasons, and multiple sensors, it is difficult for the characterization of features of a certain layer (such as high-level semantic features) to cover all features in HRRSI. In other words, most existing models are mainly based on the feature information of a certain layer (such as high-level semantic features). It is difficult to capture all the feature information in HRRSI, so only some domain-invariant features may be learned.
As shown in
Figure 2, in some scenes, such as residential and river scenes, most of the existing models experience difficulty in enhancing the weight of key regions, easily focus on backgrounds, and cause negative transfer.
Most existing models treat marginal distribution and conditional distribution as equally important and do not distinguish between them. In fact, the above view has been shown to be one-sided and insufficient [
24]. Although some scholars have realized this problem, most of their proposed methods are based on manual methods to set the parameters of the above distribution, which may fall into local optimum. In addition, some models also align both local and global features to obtain better results, but the need to manually adjust the weights of the above two parts increases the computational difficulty and time consumption.
To address these challenges, we proposed a novel remote sensing cross-domain scene classification model based on Lie group spatial attention and adaptive multi-feature distribution. This model fully considers the representation of multi-feature spaces and expands the space of domain-invariant features. The attention mechanism in the model effectively enhances the weight of key regions; suppresses the weight of irrelevant features, such as backgrounds; and dynamically adjusts the parameters of marginal distribution and conditional distribution.
The main contributions of this study are as follows:
To address the problem that limited feature representation cannot effectively learn sufficient features in HRRSI, we propose a multi-feature space representation module based on Lie group space, which projects HRRSIs into Lie group space and extracts features of different levels (low-level, middle-level, and high-level features) and different scales, effectively enhancing the spatial ability of domain-invariant features.
To address the problem of negative transfer, we design an attention mechanism based on dynamic feature fusion alignment, which effectively enhances the weight of key regions, makes domain-invariant features more adaptable and transferable, and further enhances the robustness of the model.
To address the imbalance between the relative importance of marginal distribution and conditional distribution and the problem that manual parameter setting may lead to local optimization, the proposed method takes into account the importance of the above two distributions and dynamically adjusts the parameters, effectively solving the problem of manual parameter setting and further improving the reliability of the model.
2. Method
As shown in
Figure 3, this section will introduce our proposed model in detail from several aspects, such as problem description, domain feature extraction, attention mechanism, and dynamic setting parameters.
2.1. Problem Description
represents the source domain containing -labeled samples, where represents the ith sample and represents its corresponding category information. represents the target domain containing -unlabeled samples, where represents the ith sample and represents its corresponding unknown category information. The category space and feature space of the source and target domains are the same, i.e., and , but their marginal probability distribution and conditional probability distribution are different, i.e., and . The goal of our model is to reduce the differences between the source and target domains by learning domain-invariant features in the source domain data samples.
2.2. Domain Feature Extractor
This subsection mainly includes two modules: Lie group feature learning and multi-feature representation.
2.2.1. Lie Group Feature Learning
In our previous research, we proposed Lie group feature learning [
1,
2,
3,
4]. In addition, we also draw on some approaches in the literature [
25,
26]. As shown in
Figure 3, the previously proposed method was used in this study to extract and learn the low-level and middle-level features of the data sample samples.
Firstly, the sample is mapped to the manifold space of the Lie group to obtain the data sample of Lie group space:
where
represents the
data sample of the
class in the dataset, and
represents the
data sample of
class on the Lie group space.
Then, we perform feature extraction on the data samples on the Lie group space, as follows:
Among them, the first eight features mainly extract some basic features, such as coordinates and colors. Through previous research, we found that although there are differences in the shapes and sizes of the target objects, their positions are similar. In addition to target coordinate features, color features are also an important feature, such as forest scenes. At the same time, we consider the influence of different illuminations and add
Y,
, and
features to further enhance the representation ability of the low-level features. The latter three features mainly extract middle-level features. For example,
mainly focuses on the texture and detail feature information in the scene,
mainly has the advantage of being invariant to monotony illumination,
can simulate the single-cell receptive field of the cerebral cortex and extract the spatial orientation and other information in the scene. The content related to Lie group machine learning can be referred to in our previous research [
1,
2,
3,
4,
27,
28,
29].
In terms of high-level feature learning, the approach shown in
Figure 4 is utilized. The approach consists of four parallel dilated convolution modules, each of which is followed by switchable whitening (SW) and scaled exponential linear unit (SeLU) activation functions. The reason why traditional convolution is not used in this subsection is that, in previous research [
1], we found that parallel dilated convolution can effectively expand the receptive field and learn more semantic information compared with traditional convolution, and the number of parameters is small. For details, please refer to our previous research [
1]. The SW [
30] method includes a variety of normalization and whitening methods, among which the whitening method can effectively reduce the pixel-to-pixel correlation of HRRSI, which is conducive to feature alignment. The SW used in this study includes batch normalization (BN), batch whitening (BW), instance whitening (IW), and layer normalization (LN), which can extract more discriminative features. In addition, in a previous study [
1], we also found that the traditional rectified linear unit (ReLU) activation function directly reduces to zero in a negative semi-axis region, which may lead to the disappearance of the potential gradient in the model training phase. Therefore, we adopted the SeLU activation function based on a previous study.
2.2.2. Multi-Feature Representation
In traditional models, fixed-size convolutions are usually concatenated so that the receptive field of the obtained feature map is small, and the key feature information in HRRSI may be lost. To address this problem, in previous research [
4], we proposed the multidilation pooling module, which contains four branches: the first branch directly uses global average pooling, and the other three branches adopt the multiple dilation rate of 2, 5, and 6 and, finally, join the obtained features. To improve the feature representation ability more effectively, based on previous research [
4], we optimize and improve the previous research to further explore the spatial scope of domain-invariant features.
The structure of multi-feature space representation is shown in
Figure 5, and the specific operations are as follows: (1) To effectively reduce the dimensions of features and improve the computational performance of the model,
parallel dilated convolution is adopted. (2) To extract the range of domain-invariant features more effectively, three different multiple dilation rates (
,
, and
) and SW are used. Different multiple dilation rates can effectively extract the diversity of feature space. (3) The SeLU activation function is used to ensure the nonlinear mapping of the model. (4) The above-obtained features are fused through a connection operation, the dimension of the feature map is restored to the original dimension by using the
parallel dilated convolution, and the residual connection method is used to obtain the final representation.
2.3. Alignment Attention Mechanism Based on Dynamic Feature Fusion
2.3.1. Dynamic Feature Fusion Alignment
To effectively alleviate the difference between the source domain and the target domain and find the optimal balance point in the two domains, a dynamic feature-fusion-based alignment attention mechanism is proposed in this subsection, as shown in
Figure 6. The specific expression is as follows:
where
represents the overall weight,
represents the dynamic alignment weight, and
represents the static weight. The details are as follows:
Thus, the above can be expressed as follows:
2.3.2. Three-Dimensional Spatial Attention Mechanism
The three-dimensional spatial attention mechanism mainly includes three dimensions—height, width, and channel—and realizes the interaction of the above three dimensions. The above result is utilized to obtain
through a
parallel dilated convolution, and three replicas of it are made. On the first dimension of height, we rotate it 90 degrees along the H-axis to obtain feature
. To obtain the attention weights on this dimension, we first retain rich features by pooling as follows:
where,
f represents the input feature,
;
represents average pooling; and
represents max pooling.
Then, it goes through
parallel dilated convolution layers, SW, and SeLU activation function operations in turn. Finally, it is rotated 90 degrees along the H-axis to restore the same shape as the original feature map. In the same way, the operation on the second dimension, width, is similar to the first one, except that it is rotated 90 degrees along the W-axis. In the third dimension, channel, it undergoes pooling,
parallel dilated convolution layers, and SeLU activation functions. After obtaining the weights of the above three dimensions, the aggregation is performed in the following way:
where
denotes switchable whitening,
S denotes the SeLU activation function,
denotes
parallel dilated convolution, and
denotes a rotation of 90 degrees.
2.4. Discrepancy Similarity Calculation
To address the marginal and conditional distributions efficiently, in this subsection, we propose Lie group maximum mean discrepancy (LGMMD) and Lie group conditional maximum discrepancy (LGCMMD).
2.4.1. LGMMD
Maximum mean discrepancy (MMD) is one of the typical methods to calculate the discrepancy, mainly by calculating the discrepancy of reproducing kernel Hilbert space (RKHS). The traditional MMD calculation is as follows:
where
represents the RKHS established by feature mapping, and
represents the feature mapping function, that is, by calculating the average of two samples over different distributions.
To calculate the marginal distribution between the source domain and the target domain, we optimize and improve it as follows:
where
and
denote the Lie group intrinsic means of the source and target domains, respectively. In a previous study [
27,
28], we found that the Lie group intrinsic mean can identify the potential characteristics of data samples, and the specific calculation method can be referred to in [
27,
28].
2.4.2. LGCMMD
Although LGMMD effectively reduces the distribution divergence, the conditional distribution divergence cannot be ignored. Therefore, in addition to the above, we also need to consider the conditional distribution divergence. Since the target domain does not contain labeled data samples, it is difficult to directly estimate the conditional distribution of the target domain. The usual solution is to adopt the predicted value of the target domain data as the pseudo label.
The posterior probabilities (i.e.,
and
) of the source and target domains are rather difficult to represent and are generally approximated by sufficient statistics of the class conditions, namely,
and
. Therefore, LGCMMD can be expressed as follows:
where
C represents the number of kinds of data samples.
2.4.3. Dynamic Tradeoff Parameter
The probability of the two distributions can be obtained through the above calculation, and how to set the weights of the two distributions is a key problem to be solved in this research. In previous research, we found that average search and random guessing are commonly used methods [
24]. Although the above two methods have been widely used in many models, they are relatively inefficient.
To address the above problems, we propose the dynamic tradeoff parameter in this subsection, which is as follows:
This parameter is updated in each iteration, and when the training converges, a relatively stable parameter value can be obtained.
2.5. Loss Function
The marginal distribution adaptation and conditional distribution adaptation loss functions are as follows:
respectively, where
represents the domain feature extractor.
The class classifier is used to determine the category of the input data, and the corresponding loss function is expressed as follows:
where
represents the cross-entropy loss, and
represents the category classifier.
In summary, the overall objective function is expressed as follows:
where
denotes the non-negative tradeoff parameter.
5. Conclusions
In this study, we proposed a novel remote sensing cross-domain scene classification model based on Lie group spatial attention and adaptive multi-feature distribution. We tackled the problem of insufficient feature learning in traditional models by extracting features from low-level, middle-level, and high-level features. We further optimized the multi-scale feature space representation based on our previous research, effectively expanding the space of domain-invariant features. We also designed attention mechanisms for different dimensions of space, focusing on key regions through model training to suppress irrelevant features. The experimental results indicated that our proposed method has advantages in terms of model accuracy and the number of model parameters. Our proposed method is also able to automatically adjust the parameters of the marginal and conditional distributions, which greatly improves the effectiveness and robustness of the model.
In our study, we mainly considered the characteristics of the source domain and the target domain. Therefore, in future research, we will explore the use of other data (such as Gaode map data information) to further explore cross-domain scene classification. In the future, we will continue to explore the integration of Lie group machine learning and deep learning models to improve the robustness, interpretability, and comprehensibility of cross-domain scene classification models.