Semi-Supervised Subcategory Centroid Alignment-Based Scene Classification for High-Resolution Remote Sensing Images

Mo, Nan; Zhu, Ruixi

doi:10.3390/rs16193728

Open AccessArticle

Semi-Supervised Subcategory Centroid Alignment-Based Scene Classification for High-Resolution Remote Sensing Images^†

by

Nan Mo

¹ and

Ruixi Zhu

^2,*

¹

School of Geomatics Science and Technology, Nanjing Tech University, Nanjing 211816, China

²

Department of Research, Nanjing Research Institute of Electronic Technology, Nanjing 210039, China

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our conference paper “Rotation Robust Neighbor-Based Subcategory Centroid Alignment for Cross-Domain Scene Classification of Aerial Images” that published in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2023.

Remote Sens. 2024, 16(19), 3728; https://doi.org/10.3390/rs16193728

Submission received: 7 September 2024 / Revised: 1 October 2024 / Accepted: 2 October 2024 / Published: 7 October 2024

(This article belongs to the Special Issue Deep Transfer Learning for Remote Sensing II)

Download

Browse Figures

Versions Notes

Abstract

:

It is usually hard to obtain adequate annotated data for delivering satisfactory scene classification results. Semi-supervised scene classification approaches can transfer the knowledge learned from previously annotated data to remote sensing images with scarce samples for satisfactory classification results. However, due to the differences between sensors, environments, seasons, and geographical locations, cross-domain remote sensing images exhibit feature distribution deviations. Therefore, semi-supervised scene classification methods may not achieve satisfactory classification accuracy. To address this problem, a novel semi-supervised subcategory centroid alignment (SSCA)-based scene classification approach is proposed. The SSCA framework is made up of two components, namely the rotation-robust convolutional feature extractor (RCFE) and the neighbor-based subcategory centroid alignment (NSCA). The RCFE aims to suppress the impact of rotation changes on remote sensing image representation, while the NSCA aims to decrease the impact of intra-category variety across domains on cross-domain scene classification. The SSCA algorithm and several competitive approaches are validated on two datasets to demonstrate its effectiveness. The results prove that the proposed SSCA approach performs better than most competitive approaches by no less than 2% overall accuracy.

Keywords:

domain adaptation; scene classification; feature distribution bias; neighbor-based subcategory centroid alignment; rotation-robust convolutional feature extractor

1. Introduction

In recent years, the successful emission of high-resolution remote sensing satellites has made them significant for land-cover classification. Scene classification can extract high-level semantic information from remote sensing images, which has been widely applied to land-cover classification [1,2]. The traditional supervised scene classification methods that have achieved great success are usually dependent on the availability of abundant samples. However, it is usually hard to label adequate annotated data for satisfactory results [3,4]. To solve this problem of insufficient samples, semi-supervised domain adaptation methods have been studied for decades, which can transfer the knowledge learned from previously labeled data to images with limited labeled data [5,6,7]. According to [8], three categories of semi-supervised domain adaptation approaches that are utilized for classifying remote sensing images are as follows:

1. Invariant feature selection methods [9,10,11]. The features that are robust to the domain or spectral shift are derived based on the original features for training a more discriminative classifier. This family of methods cannot perform well on heterogeneous domain adaptation tasks.

2. Classifier adaptation methods [12,13,14]. Here, the classifier trained from previously labeled samples takes into account the target unlabeled samples to adapt the source classifier to the target data. It may not adapt well when the probability distribution bias between different high-resolution remote sensing images is strong.

3. Data distribution adaptation approaches [15,16,17]. This type of method is aimed at making the data from different domains share similar data distributions, which allows for the classifiers obtained from existing labeled data to classify target images with different feature distributions.

The previously labeled data and the remote sensing images to be classified usually have different feature spaces and highly different probability distributions. For these reasons, we mainly study data distribution adaptation approaches. The purpose of studying data distribution adaptive methods is to solve the problem of data distribution deviation between existing sample labels and remote sensing images to be classified, due to differences in geographical environments, locations, seasons, imaging modalities, etc. [18]. Semi-supervised data distribution adaptation methods explore the hidden relationships between previously labeled data and unlabeled images when limited data are available. Therefore, it is important for them to learn image representations that are insensitive to domain shifts. The dictionary learning approaches that belong to data distribution adaptation methods can provide domain-insensitive sparse representations. The advantage of dictionary learning methods is that they can represent the high-dimensional information in remote sensing images by a linear combination of multiple visual dictionary features [19]. Dictionary learning methods have demonstrated better domain adaptation performance than some of the existing semi-supervised methods, including manifold alignment [20], transfer component analysis [21], and class centroid alignment (CCA) [22]. However, several issues still remain that negatively influence the learning of domain-insensitive feature representations.

1. The rotation variance may contribute to the feature distribution bias. Figure 1 shows examples of scene images with rotation variance. The spatial distribution of objects in high-resolution remote sensing images usually has random directions because remote sensing images taken overhead have different shooting angles. Consequently, the rotation robustness should be considered in feature representations.

2. A great data distribution bias exists in instances across domains, exacerbating the severity of high intra-class diversity and increasing the difficulty of classification. The intra-class diversity may be caused by different sensors, locations, and natural environments. Figure 2 shows that the river and airport categories demonstrate very different spectral characteristics. The high intra-class diversity can increase the difficulty of distinguishing cross-domain remote sensing images with similar land-cover types.

In order to handle the rotation variance, existing feature extraction methods are usually based on manual features or deep learning features. Among them, manual features are usually integrated with rotation information. A cyclic shift was used to generate a rotation-invariant LBP feature using the method of [23]. Other representative descriptors include the circular Fourier histogram of oriented gradient features, which uses orientation alignment [24], and the rotation-invariant histogram of oriented gradients (HOG) [25], which utilizes radial gradient transform. Although the above features can perform well under certain circumstances, their performances are limited in high-resolution remote sensing images. That is because hand-crafted features may fail to describe the hidden semantic information well [26]. Deep-learning-based methods incorporate rotation invariance into the existing convolutional neural network (CNN) architectures so as to overcome the limitations of hand-crafted features. In order to obtain robustness in rotation, spatial transformer networks [27], transformation-invariant pooling [28], oriented response networks [29], the group-equivariant CNN framework [30], and rotation-equivariant vector field networks [31] have all been proposed. However, the existing CNN feature extractors usually only use RGB three-channel images for feature extraction, without considering improving the adaptability of features to the rotation variance of scene images from the perspective of image representation.

Cross-domain data distribution can be aligned by decreasing the means, subspace eigenvectors, correlation coefficients, or covariance matrix between domains. Tuia et al. [20] proposed a manifold alignment approach where the manifolds of cross domains are matched. Matasci et al. [21] utilized semi-supervised transfer component analysis to make the means of cross-domains close. Volpi et al. [32] performed feature alignment by maximizing the correlation coefficient between the data of the cross domains. Li et al. [33] derived a common kernel space in which the data distributions of two heterogeneous cross-domains are aligned. Sparse representation with reconstruction strategies and methods based on low-rank representations are proposed to reduce the differences in the target representation [34]. However, the above methods ignore the fact that intra-category variety exacerbates the effect of improving the feature spatial distribution deviation on cross-domain scene classification.

The highlights in this paper are as follows:

The proposed RCFE incorporates rotation robustness into convolution feature extractor where both rotation-invariant HOG images and original images are considered as the input, which can reduce the impact of spectral shift and rotation variance on feature extraction.
We proposed the NSCA method by moving the target features toward the relevant subcategories of their source domain features in order to reduce the deviation between feature distributions across domains.
The proposed SSCA framework with RCFE and NSCA achieves a classification accuracy that is better than that of most of existing methods on two testing datasets.

The rest of this article is arranged as follows. The key theory of the proposed SSCA algorithm is described in Section 2. The descriptions of the datasets, the experimental setup, and the experimental results are provided in Section 3. Feature visualization and experiment analysis are provided in Section 4. Finally, we outline our conclusions and potential future research work in Section 5.

2. Materials and Methods

We propose an SSCA framework to classify the land-cover types of scene images with limited samples. Figure 3 depicts the overall flowchart of the proposed SSCA framework.

Step 1. Rotation-invariant HOG images are generated for original images in different domains. Different colors in rotation-invariant HOG images represent different magnitudes.

Step 2. The original images and their corresponding rotation-invariant HOG images are the input of RCFE to extract rotation-robust convolutional features.

Step 3. Move the feature of target images towards their corresponding subcategories of source domain images in the feature space whose direction is determined by the proposed NSCA to obtain optimized convolutional features.

Step 4. Based on the previously labeled data and moved target features, train an SVM classifier and predict each unlabeled target image.

2.1. Generating Rotation-Invariant HOG Images

Rotation-invariant HOG images which have been successfully applied to object detection of remote sensing images can help to reduce the negative influence of rotation variance that may decrease the ability of convolutional features to distinguish diverse land-cover types. The process of obtaining rotation-invariant HOG images is as follows.

First of all, the Fourier HOG is calculated based on the remote sensing scene image I. The gradient map D of the image I in the horizontal and vertical directions is calculated according to Equation (1). The Fourier HOG

{\hat{F}}_{m} (x)

is calculated based on the gradient map

D (x)

of the image through Equation (2), where

e^{- i m Φ (D (x))}

is the Fourier basis and

Φ (D (x))

is the gradient direction. Fourier HOG feature map

{\hat{F}}_{m}

is normalized to obtain

{\tilde{F}}_{m}

through Equation (3), where N is the smoothing convolutional kernel.

D = \nabla I

(1)

{\hat{F}}_{m} (x) = ‖D (x)‖ e^{- i m Φ (D (x))}, m \in ℤ

(2)

{\tilde{F}}_{m} = {\hat{F}}_{m} / \sqrt{{‖D‖}^{2} * N}

(3)

Then, Fourier HOG is used to generate regional features. Regional features

B

are computed by convolutions with circular harmonic basis functions

U_{p, q}

through Equation (4). In Equation (5), the

P_{p} (r)

is the radial function and q is the rotation order of the output function. Compute the convolution between the basis function

U_{p, q}

and the Fourier HOG

{\tilde{F}}_{m}

and generate the feature describing the HOG features in the region covered by

U_{p, q}

.

B = U_{p, q} * {\tilde{F}}_{m}

(4)

U_{p, q} = P_{p} (r) e^{i q φ}

(5)

Finally, generate final rotation-invariant features based on the regional features

B

. The complex-valued features

B

are separated into real and imaginary parts to generate real-valued rotation-invariant images.

The obtained rotation-invariant HOG image can also reduce the spectral differences between different color spaces to a certain extent. Figure 4 shows original images in different color spaces along with their rotation-invariant HOG images. As shown in Figure 4b,d, rotation-invariant HOG images demonstrate less spectral difference compared with their corresponding original images. Therefore, rotation-invariant HOG images can reduce the spectral shift and rotation variance in some color spaces. In order to incorporate rotation variance into convolutional feature extraction, original images and their corresponding rotation-invariant HOG images are used as input of the RCFE.

2.2. Rotation-Robust Convolutional Feature Extractor

The input of existing convolutional neural network extractors is usually three-channel spectral data, without considering that rotation-invariant information may exist in remote sensing images. The proposed RCFE method uses three-channel spectral data as well as rotation-invariant images as the input of CNN models, which is conducive to reducing the negative impact of rotation variances of scene classification on convolutional features.

The rotation-robust convolutional feature extractor requires training with original images and rotation-invariant HOG images. However, the initial weights of existing CNN models are usually pre-trained with the ImageNet dataset, which is not conducive to the training and convergence of CNN models. Considering the scale variance in diverse land-cover types, multi-scale images are generated for input images. Three different scales are used in the experiments undertaken in this study, and the proportion between the adjacent scales is set to 0.5. As shown in Figure 5, the proposed method downsamples the image to obtain input images of three different scales including scale level 1, scale level 2, and scale level 3. The scale levels are defined in the order of decreasing resolution. That is to say, scale level 1 is the finest scale, namely, the original image size. Scale level 3 is the coarsest scale. ResNet 101 [35] is used as the backbone for feature extraction. The coarser CNN is fine-tuned on images of scale level 3. Then the finer CNN is initialized with the pre-trained coarser weights and fine-tuned with the finer images. The feature extractor trained from the finest-scale images is used to provide initial features for the NSCA method.

2.3. Neighbor-Based Subcategory Centroid Alignment

Because of the feature distribution bias caused by diverse sensors, locations, seasons, or nature environments, the classifier trained from the source labeled data may deliver poor performance in classifying target images. Moreover, the remote sensing images demonstrate higher within-class spectral differences between source labeled data and target data and the within-class spectral difference may aggravate the feature distribution bias. Moving the target features toward the direction of their corresponding source labels can help to increase classification accuracy by decreasing the distribution difference between source and target features. That is because similar feature distributions between source and target features can make the classifier trained from the source labeled data classify the target data well. But how to determine the direction of the moving target images still needs investigating.

The difference between existing CCA and the proposed NSCA method is that the existing CCA method moves target features toward the mean of difference between each neighbor image feature vector and its corresponding class centroid as shown in Figure 6a. However, the target image will not be moved toward its corresponding source class when a target image is close to source labeled data that are far from the centroid of its own class. That is because high within-class diversity in some land-cover categories and source labeled data that are far from the corresponding class centroid but close to another class centroid may lead to inaccurate moving direction. Therefore, the NSCA method proposes to move target features toward a more accurate direction by replacing the class centroid with a subcategory centroid as shown in Figure 6b. The new direction calculates the difference between each neighbor image feature vector and its corresponding center of predicted subcategories of classes rather than predicted classes. The new direction can increase the possibility of finding corresponding classes for target images. The difference between the moving directions of the proposed NSCA and those of the existing CCA is depicted in Figure 6a,b. The details of NSCA can be illustrated as follows.

Let

X_{s} \in ℝ^{d \times n_{s}}

denote source features extracted from the RCFE with labels

Y_{s} \in ℕ^{1 \times n_{s}}

.

X_{t} \in ℝ^{d \times n_{t}}

represents target features extracted from the RCFE, where d reflects the feature dimension.

n_{s}

and

n_{t}

are the source and target image number.

X_{t s} \in ℝ^{d \times n_{t}}

represents moved target features.

Ω = [Ω_{11}, \dots, Ω_{1 k}, \dots, Ω_{C 1}, \dots, Ω_{C k}]

represents

k \times C

subcategories clustered from

X_{s}

. k is the subcategory number in each class and C is the class number.

Y_{t}

is the target pseudo subcategories obtained by a classifier of

k \times C

subcategories trained from

X_{s}

and

Y_{s}

.

Ω

and

Y_{t}

are used to calculate the moving directions for the NSCA method.

To determine the moving direction

d_{i j}

, one subcategory

Ω_{i j}

is represented by its cluster center. The centroid of one target subcategory is calculated by the mean of target feature vectors whose pseudo subcategory is the corresponding subcategory. Then the domain shift can be represented by the discrepancy

d_{i j}

between the centroid of same subcategory in diverse domains

U_{s}^{i j}

and

U_{t}^{i j}

. The

d_{i j}

,

U_{s}^{i j}

, and

U_{t}^{i j}

are shown in Equations (6)–(8), respectively.

d_{i j} = U_{s}^{i j} - U_{t}^{i j}, i \in [1, C], j \in [1, k]

(6)

U_{s}^{i j} = \frac{\sum_{y_{s i} \in Ω_{i j}} x_{s i}}{N_{s}^{i j}}

(7)

U_{t}^{i j} = \frac{\sum_{y_{t i} \in Ω_{i j}} x_{t i}}{N_{t}^{i j}}

(8)

where

U_{s}^{i j}

and

N_{s}^{i j}

represent the mean and quantity of the previously labeled data belonging to the j-th subcategory of the i-th class, respectively.

U_{t}^{i j}

and

N_{t}^{i j}

are the mean and number of target feature vectors predicted as the j-th subcategory of the i-th class. Then each moved target feature becomes

x_{t s} = x_{t} + d_{i j}

.

The moving direction

d_{i j}

of target features may be inaccurate when a target image is wrongly predicted. The nearest neighbors of it may be correctly predicted and the association between target features and their nearest neighbors needs keeping after moving. Therefore, the optimized direction that considers nearest neighbors is in Equation (9).

d = \frac{\sum_{j = 1}^{C \times k} \sum_{i = 1}^{M} δ (y_{i}, Ω_{j}) d_{i j}}{M}

(9)

where M represents the nearest neighbor number. The pseudo labels of all neighbors are denoted as

Y_{N} = [y_{1}, \dots, y_{M}]

and

δ (y_{i}, Ω_{j})

is calculated as Equation (10):

δ (y_{i}, Ω_{j}) = \{\begin{cases} 1 y_{i} = Ω_{j} \\ 0 o t h e r w i s e \end{cases}

(10)

Algorithm 1 describes the procedure of the NSCA approach as follows.

Algorithm 1 NSCA approach description

1: Input: target features

X_{t}

, target labels

Y_{t}

, source features

X_{s}

, source labels

Y_{s}

, category number C, nearest neighbor number M, subcategory number k.

2: Output: target features after moving

X_{t s}

.

3: Source features of all categories

X_{s}

are divided into k × C subcategories with k-means. There exist k subcategories in each category,

Ω = [Ω_{11}, \dots, Ω_{1 k}, \dots, Ω_{C 1}, \dots, Ω_{C k}]

represent all subcategories. The source and target images belong to

Ω_{i j}

are considered as label

Ω_{i j}

.

4: While predictions

Y_{t}^{l}

is not convergent do

5: A classifier of k × C subcategories is trained based on

X_{s}

,

X_{t}

and

Ω

.

6: When the iteration l is set to 1, the predicted label

Y_{t}^{l}

for

X_{t}

is predicted by the trained classifier.

7:

U_{s}

and

U_{t}^{l}

is estimated based on

Ω

and

Y_{t}^{l}

.

8:

d_{i j}

is calculated for each subcategory

Ω_{i j}

based on Equations (6)–(8).

9: Find M nearest neighbors for each target feature, whose direction is calculated by Equation (9).

10: Each target feature

x_{t}^{l}

is moved based on

x_{t s}^{l} = x_{t}^{l} + d

11: The moved target feature

X_{t s}^{l}

is predicted by the classifier in step 5.

12. The predicted label is updated in the iteration l + 1

13: End while

14: Return

X_{t s}^{l}

3. Results

3.1. Dataset Partition and Description

NWPU-RESISC45 [36] and RSI-CB256 [37] are selected as training datasets for experiments in this paper, which provide rich image variations and high within-class diversity with a varied resolution from 0.3 to 3 m. And two datasets have category complementarity and can cover the category types of the target domain. Twenty percent of the labeled data from UC Merced [38] and SIRI-WHU [39] are used as the validation set for parameter selection. Eighty percent of the unlabeled data from SIRI-WHU and UC Merced datasets are used as the test set for accuracy evaluation. The validation set and the test set should be collected at different times, locations, and sensors from the source domain training samples. The image resolutions of UC Merced and SIRI-WHU are 0.3 m and 0.6 m, respectively. The two datasets are collected at different times and locations from the training data, so UC Merced and SIRI-WHU are selected as target domain data. Figure 7 and Figure 8 show the examples of categories in the training dataset and testing dataset, respectively. There are large differences in spectral and spatial distribution between the source domain and the target domain samples. Table 1 describes the class number used in the experiments. Common classes existing in the testing and training dataset are used. ✕ represents no samples in this category.

NWPU-RESISC45 dataset: The dataset originally contains 45 scene categories, and we selected 21 of them as training data for this article. The scene images with size of

256 \times 256

are all clipped from Google Earth imagery.

RSI-CB256 dataset: The dataset originally contains 35 scene categories, and we selected 8 of them as training data for this article. This dataset is also with a size of

256 \times 256

. These scenes are with a resolution ranging from 0.3 to 3 m in the RGB space.

UC Merced dataset. This dataset contains 21 categories, with a size of

256 \times 256

and 0.3 m resolution. Figure 8a depicts the examples of 21 categories in the UC Merced dataset.

SIRI-WHU dataset. This dataset is from Montgomery, Ohio in the USA (latitude

32 ° 22^{'} N

, longitude

86 ° 2^{'} E

). The original image size is 10,000 × 9000 and with a resolution of 0.6m. The original image is divided into patches of

256 \times 256

. This dataset contains six categories. Figure 8b,c depict the original large image and the examples of six categories in the SIRI-WHU dataset.

3.2. Experimental Setup

The proposed SSCA approach and some competitive methods are compared to demonstrate its effectiveness. The optimum hyperparameters of the SSCA including the number of subcategories in each category k and number of nearest neighbors M are calculated by the validation dataset. The sensitivity analysis of these two parameters was performed when fixing other parameters. ResNet101 [35] is the CNN model to extract initial features. We choose SVM as the classifier for the method proposed in the paper. SVM can handle high-dimensional features and nonlinearly separable data by using kernel functions. In the case of scarce samples, SVM has good robustness and generalization ability.

Five data distribution adaptation methods are compared with the proposed SSCA method to ensure the competitive accuracy of the SSCA framework in data distribution adaptation methods. This family of methods covers existing dictionary learning methods including domain-adaptive dictionary learning (DADL) [40], incremental dictionary learning (IDL) [41], class centroid alignment (CCA), and asymmetric adaptation of deep features (AADF) [42].

Four adversarial domain adaptation methods including semi-supervised center-based discriminative adversarial learning (SCDAL) framework [13], adversarial discriminative domain adaptation (ADDA) [43], conditional adversarial domain adaptation (CADA) [44], and collaborative and adversarial network (CAN) [45] are compared with the SSCA method to show its competitiveness over the adversarial domain adaptation techniques.

In the experiments, features of the RCFE are utilized for methods including the proposed NSCA, CCA, DADL, and IDL. The role of the rotation-invariant HOG images and NSCA is evaluated by performing ablation studies. The optimum experimental setup of all baseline methods and the proposed framework is shown in Table 2.

Four different evaluation metrics were calculated to assess the classification performance, mainly including confusion matrix, overall accuracy for each category, overall accuracy, and kappa coefficient. Among them, kappa coefficient provides a more reliable consistency measure than simple accuracy by taking into account the accidental factors of classification, which ranges from −1 to 1. Here, −1 means completely inconsistent, 0 means random guess, and 1 means completely consistent.

3.3. Comparison Experiment

Table 3 describes the overall accuracy of all compared approaches shown in Section 3.2. The proposed SSCA method performs better than compared approaches by at least 2% because it can decrease the influence of intra-category variety and rotation variance for decreased distribution bias. SSCA is a method between the middle-level feature method and the high-level feature method. This method takes spectral images and rotation-invariant images as input and uses the neural network trained with coarse-scale images as the initial weights of higher-scale images to obtain the deep features of the source and the target images. The NSCA method is able to narrow the distance between the deep features extracted from across domain images. Therefore, our method can reduce the negative effects of rotation changes, spectral bias, and cross-domain intra-class differences on feature extraction and obtain more robust land-cover classification results.

The SCDAL framework delivers the second-highest classification accuracies since it also makes the spatial distance of source and target images closer in the feature space. The existing dictionary learning methods including SSCA, DADL, IDL, CCA, and AADF deliver poorer classification performance because they ignore the discriminative ability of the learned dictionary, which may lead to confusion with similar land-cover types. CAN or CADA delivers poorer performance because the training and testing datasets have diverse feature distributions. CCA, ADDA, and AADF can make target features close to source features but they do not address the data distribution bias from the image representation.

3.4. Ablation Experiment

Table 4 shows the classification accuracy of ablation studies in both datasets. The rotation-invariant HOG image plays a more important role than only input original input images since it can decrease the impact of both spectral shift and rotation variance. The NSCA method is more significant in increasing accuracy since it can decrease the negative influence of high intra-category variety on feature representations from different domains.

The classification results in Figure 9a are clipped in three locations. As shown in Figure 9b–e, the classification maps are generated for the SIRI-WHU dataset so as to have an intuitive feeling of the land-cover classification results. The NSCA method plays a more important role than the rotation-invariant HOG image because the NSCA method can reduce the feature distribution bias by reducing the impact of intra-class diversity on the adaptation process. And the classifier has a high discrimination ability for the moved target domain features.

Confusion exists in farmland/forest, freeway/residential, and parking lot/residential, as can be seen in Figure 9f–h. This confusion occurs mostly in the method without the NSCA method and least in the proposed SSCA framework. Rotation-invariant HOG images and NSCA can reduce data distribution deviation to varying degrees, which leads to different land-cover mapping performances.

4. Discussion

4.1. Confusion Analysis

According to the confusion matrix of the SSCA method provided in Figure 10, we can analyze the misclassified categories of the scene classification results. The kappa coefficient is calculated based on the confusion matrix. The kappa coefficient of the UC Merced dataset is 0.976, and the kappa coefficient of the SIRI-WHU dataset is 0.902, both of which are close to 1, further proving that the proposed method has good classification performance. Confusion exists in the UC Merced dataset for medium residential/dense residential, runway/forest, tennis court/intersection, and storage tank/building as shown in Figure 10a and Figure 11a–d. Confusion occurs in freeway/parking lot, river/forest, residential/parking lot, and residential/freeway for the SIRI-WHU dataset as shown in Figure 10b and Figure 11e–h. Figure 10a,c,e,g share similar backgrounds including buildings, trees, or soil while diverse spatial distributions of similar objects including buildings or vehicles may lead to a confusion in Figure 11b,d,f,h.

As shown in Figure 12, the above four images are scene examples of building, dense residential, medium residential, and mobile homepark. The above images are composed of buildings, but the types and spatial distributions of buildings are different. Since it is impossible to understand the criteria used by sample annotators to distinguish the above four categories, it is difficult for humans to distinguish the above categories. The misclassification of our method mainly occurs in the situation of small inter-class differences. Due to the lack of guidance from prior knowledge of spatial distribution, our method in this paper also struggles to accurately distinguish these types of scenes.

4.2. Feature Visualization

Figure 13 shows the feature visualization comparison results before and after performing NSCA. As shown in Figure 13, the proposed NSCA approach plays an important role in increasing classification accuracy since it can address overlapped categories well by making the topologies of target data and source data similar in the feature space.

4.3. Sensitivity Analysis

When performing semi-supervised domain adaptation, the effects of the nearest neighbor parameter M and the subcategory parameter k on the overall accuracy were studied on the test dataset. The classification accuracy in Figure 14 increases at first before decreasing. For the number of subcategories, too many subcategories may lead to a lower performance in those categories with relatively low within-class diversity. If the number of nearest neighbors is too large, some nearest neighbors may have a negative influence on determining the moving direction of the target image.

5. Conclusions

A semi-supervised subcategory centroid alignment method for cross-domain scene classification called SSCA is presented in this paper, which is used to increase the classification performance when limited target labeled data are available. In the SSCA framework, our method introduces the HOG feature map based on the three-channel image and uses the HOG feature map with small spectral differences and rotation invariance to improve the adaptability of features to spectral differences and rotation variance. In addition, the NSCA method is improved based on the CCA method to further increase the model’s ability to distinguish different types of objects by decreasing the feature distribution bias.

The experimental results show that the SSCA framework with RCFE and NSCA outperforms previous representative domain adaptation approaches. The ablation studies with the SSCA framework also show that the rotation-invariant HOG images and the NSCA can increase the performance with overall classification accuracy improvements of 1.2 and 4.1%, respectively. The feature visualization results demonstrate the effectiveness of moving target features toward corresponding subcategories of the source domain in reducing intra-category variety and feature distribution bias across domains.

However, the SSCA method has limitations in distinguishing scenes composed of similar objects but with different spatial distributions because the prior information of spatial distribution is not incorporated into the SSCA. We will further address this issue in future work.

Author Contributions

Conceptualization, R.Z. and N.M.; methodology, R.Z.; validation, R.Z. and N.M.; writing—original draft preparation, R.Z.; writing—review and editing, N.M.; supervision, R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China 24KJB420004.

Data Availability Statement

All the data used in this paper come from public datasets. The UC Merced dataset is available at http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 1 October 2024). The NWPU-RESISC45 dataset is available at https://gcheng-nwpu.github.io/#Datasets (accessed on 1 October 2024). The RSI-CB 256 dataset is available at https://github.com/lehaifeng/RSI-CB (accessed on 1 October 2024). The SIRI-WHU dataset is available at http://rsidea.whu.edu.cn/resource_sharing.htm (accessed on 1 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Adegun, A.A.; Viriri, S.; Tapamo, J.R. Review of deep learning methods for remote sensing satellite images classification: Experimental survey and comparative analysis. J. Big Data 2023, 10, 93. [Google Scholar] [CrossRef]
Zhang, Q.; Yuan, Q.; Song, M.; Yu, H.; Zhang, L. Cooperated spectral low-rankness prior and deep spatial prior for HSI unsupervised denoising. IEEE Trans. Image Process. 2022, 31, 6356–6368. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Zheng, Y.; Yuan, Q.; Song, M.; Yu, H.; Xiao, Y. Hyperspectral image denoising: From model-driven, data-driven, to model-data-driven. IEEE Trans. Neural Netw. Learn. Syst. 2023, 6, 1–21. [Google Scholar] [CrossRef] [PubMed]
Thapa, A.; Horanont, T.; Neupane, B.; Aryal, J. Deep learning for remote sensing image scene classification: A review and meta-analysis. Remote Sens. 2023, 15, 4804. [Google Scholar] [CrossRef]
Qiao, H.; Qian, W.; Hu, H.; Huang, X.; Li, J. Semi-Supervised Building Extraction with Optical Flow Correction Based on Satellite Video Data in a Tsunami-Induced Disaster Scene. Sensors 2024, 24, 5205. [Google Scholar] [CrossRef]
Liu, K.; Yang, J.; Li, S. Remote-Sensing Cross-Domain Scene Classification: A Dataset and Benchmark. Remote Sens. 2022, 14, 4635. [Google Scholar] [CrossRef]
Tuia, D.; Persello, C.; Bruzzone, L. Domain Adaptation for the Classification of Remote Sensing Data: An Overview of Recent Advances. IEEE Geosci. Remote Sens. Mag. 2016, 4, 41–57. [Google Scholar] [CrossRef]
Yan, L.; Zhu, R.; Mo, N.; Liu, Y. Cross-Domain Distance Metric Learning Framework with Limited Target Samples for Scene Classification of Aerial Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3840–3857. [Google Scholar] [CrossRef]
Yang, C.; Dong, Y.; Du, B.; Zhang, L. Attention-Based Dynamic Alignment and Dynamic Distribution Adaptation for Remote Sensing Cross-Domain Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5634713. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Su, A.; Wang, K.; Wang, Z.; Yu, Q. Semi supervised Cross-Domain Remote Sensing Scene Classification via Category-Level Feature Alignment Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621614. [Google Scholar]
Bahirat, K.; Bovolo, F.; Bruzzone, L.; Chaudhuri, S. A Novel Domain Adaptation Bayesian Classifier for Updating Land-Cover Maps with Class Differences in Source and Target Domains. IEEE Trans. Geosci. Remote Sens. 2012, 50, 2810–2826. [Google Scholar] [CrossRef]
Wei, H.; Ma, L.; Liu, Y.; Du, Q. Combining Multiple Classifiers for Domain Adaptation of Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1832–1847. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Su, Y.; Ma, A. Domain Adaptation via a Task-Specific Classifier Framework for Remote Sensing Cross-Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620513. [Google Scholar] [CrossRef]
Zhu, R.; Yan, L.; Mo, N.; Liu, Y. Semi-supervised center-based discriminative adversarial learning for cross-domain scene-level land-cover classification of aerial images. ISPRS J. Photogramm. Remote Sens. 2019, 155, 72–89. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, M.; Tao, R.; Li, W.; Liao, W.; Philips, W. Cross-Domain Classification of Multisource Remote Sensing Data Using Fractional Fusion and Spatial-Spectral Domain Adaptation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5721–5733. [Google Scholar] [CrossRef]
Zhu, S.; Wu, C.; Du, B.; Zhang, L. Style and content separation network for remote sensing image cross-scene generalization. ISPRS J. Photogramm. Remote Sens. 2023, 201, 1–11. [Google Scholar] [CrossRef]
Ye, M.; Qian, Y.; Zhou, J.; Yuan, Y. Dictionary Learning-Based Feature-Level Domain Adaptation for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 1544–1562. [Google Scholar] [CrossRef]
Patel, V.M.; Gopalan, R.; Li, R. Visual Domain Adaptation: An Overview of Recent Advances. IEEE Signal Process. Mag. 2015, 32, 53–69. [Google Scholar] [CrossRef]
Tuia, D.; Volpi, M.; Trolliet, M.; Camps-Valls, G. Semi-supervised Manifold Alignment of Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2014, 52, 7708–7720. [Google Scholar] [CrossRef]
Matasci, G.; Volpi, M.; Kanevski, M.; Bruzzone, L.; Tuia, D. Semisupervised Transfer Component Analysis for Domain Adaptation in Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3550–3564. [Google Scholar] [CrossRef]
Zhu, L.; Ma, L. Class centroid alignment based domain adaptation for classification of remote sensing images. Pattern Recognit. Lett. 2016, 83, 124–132. [Google Scholar] [CrossRef]
Ojala, T.; Pietikäinen, M.; Mäenpää, T. Gray Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Skibbe, H.; Reisert, M. Circular Fourier-HOG features for rotation invariant object detection in biomedical images. In Proceedings of the IEEE International Symposium on Biomedical Imaging, Barcelona, Spain, 2–5 May 2012; pp. 450–453. [Google Scholar]
Liu, K.; Skibbe, H.; Schmidt, T.; Blein, T.; Palme, K.; Brox, T.; Ronneberger, O. Rotation-Invariant HOG Descriptors Using Fourier Analysis in Polar and Spherical Coordinates. Int. J. Comput. Vis. 2014, 106, 342–364. [Google Scholar] [CrossRef]
Gong, C.; Yang, C.; Yao, X.; Lei, G.; Han, J. When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025. [Google Scholar]
Laptev, D.; Savinov, N.; Buhmann, J.M.; Pollefeys, M. TI-Pooling: Transformation-invariant pooling for feature learning in Convolutional Neural Networks. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 289–297. [Google Scholar]
Zhou, Y.; Ye, Q.; Qiang, Q.; Jiao, J. Oriented Response Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4961–4970. [Google Scholar]
Cohen, T.S.; Welling, M. Group Equivariant Convolutional Networks. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 2990–2999. [Google Scholar]
Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef]
Volpi, M.; Camps-Valls, G.; Tuia, D. Spectral alignment of multi-temporal cross-sensor images with automated kernel canonical correlation analysis. J. Photogram. Remote Sens. 2015, 107, 50–63. [Google Scholar] [CrossRef]
Li, X.; Zhang, L.; Du, B.; Zhang, L.; Shi, Q. Iterative reweighting heterogeneous transfer learning framework for supervised remote sensing image classification. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2017, 10, 2022–2035. [Google Scholar] [CrossRef]
Sun, H.; Liu, S.; Zhou, S.; Zou, H. Transfer sparse subspace analysis for unsupervised cross-view scene model adaptation. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2016, 9, 2901–2909. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gong, C.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar]
Li, H.; Dou, X.; Tao, C.; Wu, Z.; Chen, J.; Peng, J.; Deng, M.; Zhao, L. RSI-CB: A Large-Scale Remote Sensing Image Classification Benchmark Using Crowdsourced Data. Sensors 2020, 20, 1594. [Google Scholar] [CrossRef] [PubMed]
Yi, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the Sigspatial International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Zhong, Y.; Zhu, Q.; Zhang, L. Scene Classification Based on the Multifeature Fusion Probabilistic Topic Model for High Spatial Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6207–6222. [Google Scholar] [CrossRef]
Qiang, Q.; Patel, V.M.; Turaga, P.; Chellappa, R. Domain Adaptive Dictionary Learning. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 631–645. [Google Scholar]
Lu, B.; Chellappa, R.; Nasrabadi, N.M. Incremental Dictionary Learning for Unsupervised Domain Adaptation. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 108.1–108.12. [Google Scholar]
Ammour, N.; Bashmal, L.; Bazi, Y.; Rahhal, M.A.; Zuair, M. Asymmetric Adaptation of Deep Features for Cross-Domain Classification in Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2018, 15, 597–601. [Google Scholar] [CrossRef]
Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 1640–1650. [Google Scholar]
Zhang, W.; Ouyang, W.; Li, W.; Xu, D. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3801–3809. [Google Scholar]
AlRahhal, M.; Bazi, Y.; AlHichri, H.; Alajlan, N.; Melgani, F.; Yager, R.R. Deep learning approach for active classification of electrocardiogram signals. Inf. Sci. 2016, 345, 340–354. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]

Figure 1. Examples of scene images with rotation variance: (a) airport and (b) residential.

Figure 2. The intra-class diversity caused by different locations, sensors, and environments.

Figure 3. The overall flowchart of the proposed SSCA framework.

Figure 4. Rotation-invariant HOG images in different spectral conditions. (a,c) Original images in different color spaces. (b,d) The rotation-invariant HOG images of (a,c).

Figure 5. The structure of the rotation-robust convolutional feature extractor. Different colors of label mean different scene categories.

Figure 6. The difference between the moving directions of the existing CCA method and the proposed NSCA method. (a) The moving direction of CCA. (b) The moving direction of NSCA. (c) The determination of moving direction for (a,b).

Figure 7. Display of scene sample labels for 21 categories in the training dataset.

Figure 8. Display of scene sample labels in the testing dataset. (a) Examples of 21 classes in the UC Merced dataset. (b) Original large image of SIRI-WHU dataset. (c) Examples of 6 classes in the SIRI-WHU dataset.

Figure 9. Visualization of classification results produced by the ablation studies for the SSCA method in the SIRI-WHU dataset when performing semi-supervised domain adaptation. (a) The three clipped patches (A), (B), and (C). (b) The proposed SSCA framework. (c) Without rotation-invariant HOG images but with the original images for feature extraction. (d) Without the NSCA. (e) Ground-truth map. (f) Clipped land-cover maps in location (A). (g) Clipped land-cover maps in location (B). (h) Clipped land-cover maps in location (C).

Figure 10. Confusion matrices of the SSCA method. (a) UC Merced dataset. (b) SIRI-WHU dataset.

Figure 11. Examples of major confusion of two benchmark datasets. (a) Runway and forest. (b) Dense residential and medium residential. (c) Tennis court and intersection. (d) Storage tank and building. (e) Parking lot and freeway. (f) Residential and freeway. (g) River and forest. (h) Parking lot and residential.

Figure 12. Examples of scenes with diverse spatial distributions. (a) Building. (b) Dense residential. (c) Medium residential. (d) Mobile homepark.

Figure 13. Visualization of spatial distribution before and after feature alignment. (a) The unadapted features of UC Merced. (b) The adapted features of UC Merced. (c) The unadapted features of SIRI-WHU. (d) The adapted features of SIRI-WHU.

Figure 14. The parameter analysis of the classification accuracy in the testing data. (a) Number of subcategories. (b) Number of nearest neighbors.

Table 1. Division of experimental datasets for each category.

Class	Training Dataset		Validation Dataset		Testing Dataset
Class	NWPU-RESISC45	RSI-CB256	UC Merced	SIRI-WHU	UC Merced	SIRI-WHU
airport	700	351	20	✕	80	✕
baseball	700	✕	20	✕	80	✕
beach	700	✕	20	✕	80	✕
buildings	✕	1014	20	✕	80	✕
chaparral	700	✕	20	✕	80	✕
dense residential	700	✕	20	✕	80	✕
farmland	700	644	20	512	80	1549
forest	700	1082	20	286	80	1148
freeway	700	223	20	105	80	420
golf course	700	✕	20	✕	80	✕
harbor	700	✕	20	✕	80	✕
intersection	700	✕	20	✕	80	✕
medium residential	700	✕	20	271	80	1084
mobile homepark	700	✕	20	✕	80	✕
overpass	700	✕	20	✕	80	✕
parking lot	700	467	20	45	80	182
river	700	539	20	13	80	52
runway	700	✕	20	✕	80	✕
sparse	700	✕	20	✕	80	✕
storage tank	700	1307	20	✕	80	✕
tennis court	700	✕	20	✕	80	✕

Table 2. The experimental setup of the proposed method and comparison methods.

Types of Methods	Methods	Experimental Parameter Settings
Data distribution adaptation methods	SSCA	$k = 25, M = 7$ for UC Merced dataset; $k = 15, M = 5$ for SIRI-WHU dataset.
	DADL	Sparsity level T = 0.4, tradeoff parameter $λ = 0.3, η = 10$ , the codebook size s = 1300, the stopping threshold 0.9.
	IDL	The tradeoff parameter $λ = 0.05$ and normalization parameter $σ^{2} = 0.05$ , the codebook size s = 1300, and the number of supportive samples Q = 50.
	CCA	Number of the nearest neighbors $M = 5$ , the parameters of SVM are the same as those in SSCA.
	AADF	256-dimension features by DAE network in [46], dropout value is 0.5, learning rate is 0.1, momentum is 0.5, regularization parameter is 0.5, batch sizes are [100, 80, 60, 40, 20, 10].
Adversarial domain adaptation methods	SCDAL	$p = 4, τ = - 0.2, β = 0.5, λ = 0.5, M = 250, N = 300, k = 20, m = 0.05$ .
	CADA	Batch size 128; learning rate and momentum are the same as in the domain adversarial neural network (DANN) [47].
	CAN	The initial learning rate is 0.0015, which is decreased gradually after each iteration, as in DANN. The weight decay, momentum, and batch size were 3 × 10⁻⁴, 0.9, and 128.
	ADDA	Batch size is 128, maximum iterations are 20,000, and learning rate is 1 × 10⁻⁴.

Table 3. Comparison with previous methods in two datasets, UC Merced and SIRI-WHU.

Method	UC Merced	SIRI-WHU
The proposed SSCA	0.9314	0.9177
SCDAL	0.9118	0.8958
ADDA	0.8723	0.8617
CADA	0.8938	0.8850
CAN	0.8972	0.8756
DADL	0.8670	0.8425
IDL	0.8625	0.8541
CCA	0.8528	0.8478
AADF	0.8981	0.8730

Table 4. Ablation studies of the proposed method of UC Merced and SIRI-WHU datasets.

Method	UC Merced	SIRI-WHU
The proposed SSCA framework	0.9314	0.9177
Without rotation-invariant HOG	0.9119	0.9043
Without the NSCA method	0.8933	0.8748

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mo, N.; Zhu, R. Semi-Supervised Subcategory Centroid Alignment-Based Scene Classification for High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 3728. https://doi.org/10.3390/rs16193728

AMA Style

Mo N, Zhu R. Semi-Supervised Subcategory Centroid Alignment-Based Scene Classification for High-Resolution Remote Sensing Images. Remote Sensing. 2024; 16(19):3728. https://doi.org/10.3390/rs16193728

Chicago/Turabian Style

Mo, Nan, and Ruixi Zhu. 2024. "Semi-Supervised Subcategory Centroid Alignment-Based Scene Classification for High-Resolution Remote Sensing Images" Remote Sensing 16, no. 19: 3728. https://doi.org/10.3390/rs16193728

APA Style

Mo, N., & Zhu, R. (2024). Semi-Supervised Subcategory Centroid Alignment-Based Scene Classification for High-Resolution Remote Sensing Images. Remote Sensing, 16(19), 3728. https://doi.org/10.3390/rs16193728

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.