1. Introduction
Synthetic aperture radar (SAR) is an active radar observation system known for its all-weather, all-day advantages and is widely used in military and civilian applications. In the military domain, a crucial task is to identify targets in SAR images, such as military vehicles, tanks, ships, and aircraft. However, due to its unique imaging mechanism, SAR images are more challenging to interpret compared to optical images. The manual interpretation of SAR images requires significant manpower and material resources; therefore, it is not suitable for applications on unmanned platforms such as aircraft, missiles, and satellites. Therefore, SAR automatic target recognition (ATR) has been proposed. In recent years, SAR ATR has become a hot research area, continuously evolving and achieving promising results.
The methods for SAR ATR can be primarily categorized into traditional approaches and deep learning-based methods. Traditional approaches predominantly adhere to a framework that combines handcrafted features with classifiers. Commonly used features include geometric features [
1,
2], transform domain features [
3,
4,
5,
6], and electromagnetic scattering features [
7,
8]. Among classifiers, support vector machines (SVMs) [
9], k-nearest neighbor classifiers (k-NNs) [
10], and sparse representation classifiers (SRCs) [
11] are widely employed. With the advancement of deep learning technology, it has been extensively applied to the field of SAR, such as in SAR target recognition [
12,
13], SAR target detection [
14,
15], and SAR data augmentation [
16,
17]. Compared to traditional methods, it offers the advantages of end-to-end implementation, the automatic extraction of target features, and high accuracy, effectively enhancing system performance. Early research integrating various deep learning techniques into SAR ATR has achieved significant success [
18,
19].
Most of the aforementioned studies are based on the closed-set assumption, where the target types in the test set are included in the target types of the training set. This type of task is referred to as closed-set recognition (CSR). However, in real-world applications, the situation is often complex and dynamic, and new target categories may appear during the testing phase that were not present during the training phase. In such cases, the closed-set assumption no longer holds, leading to what is known as open-set recognition (OSR). SAR ATR methods based on the closed-set assumption often force unknown class targets to be classified as one of the known classes, resulting in unquantifiable errors and risks. A comprehensive SAR ATR system should be able to classify known classes while also effectively rejecting unknown classes.
Therefore, many scholars have focused their attention on OSR of SAR targets in recent years. OSR methods in the SAR domain can generally be categorized into discriminative approaches and generative approaches. Discriminative approaches achieve OSR by designing similarity metrics that capitalize on the differences between known and unknown classes. Scherreik et al. [
20] improved SVM, innovatively proposing the W-SVM and POS-SVM methods, and successfully applied them to the OSR of SAR targets. Dang et al. [
21] utilized extreme value theory (EVT) to construct closed boundary models for known classes, thereby detecting unknown classes. Wang et al. [
22] introduced an entropy-aware meta-learning method that innovatively processes at the feature space level, significantly improving the OSR performance of the system. Ma et al. [
23] proposed an OSR method based on the joint training of class-specific sub-dictionary learning, utilizing reconstruction error to identify unknown class targets. Due to the powerful feature extraction capabilities of deep learning, it has also been widely used in SAR OSR. For instance, many researchers have deeply explored the OpenMax method, applying it effectively to the OSR of SAR targets [
24,
25]. Giusti et al. [
26] have differentiated between known and unknown classes by leveraging the proportional similarity between different SAR image categories and setting thresholds. Thomson et al. [
27] utilized the Regular Polytope Network (RPN) for SAR OSR, which enhances the separation of target features and is beneficial for recognition performance. Inkawhich et al. [
28] developed a training method named AdvOE, significantly enhancing model accuracy. Additionally, Ma et al. [
29] divided OSR into two tasks: classification and anomaly detection, implementing the task with generative adversarial networks (GANs) for SAR images. Zhou et al. [
30] explored SAR OSR under data-constrained conditions, utilizing graph convolutional network (GCN) to construct distributions and explaining unknown classes through discriminative class similarity.
In contrast to discriminative approaches, generative approaches primarily address the OSR problem by generating instances of unknown classes and training them alongside known classes, thereby transforming the open-set issue into a closed-set issue for resolution. Cui et al. [
31] combined a counterfactual framework with SAR OSR, employing counterfactual images generated from test samples to determine whether they belong to known or unknown classes. Geng et al. [
32] proposed two innovative generative models: spatial clipping generating (SCG) and weighting generating (WG) models, demonstrating their excellent performance in known class categorization and unknown class detection through a series of experiments. Neal et al. [
33] proposed a novel framework called OSRCI, which employs GANs to generate samples that closely resemble known classes but do not belong to any of them, effectively converting the OSR problem into a CSR problem with an additional class. Jo et al. [
34] conducted further investigations into the application of GAN for the generation of synthetic data, with the objective of augmenting the robustness of classifiers when encountering unknown classes.
The aforementioned studies have achieved satisfactory outcomes, providing valuable references for future work. Summarizing the aforementioned research findings, the SAR OSR primarily confronts several challenges: First, the diversity and complexity of SAR targets. When the perspective, attitude, and angle of SAR targets change, their imagery undergoes significant alterations, which traditional feature extraction methods may struggle to capture. Second, the presence of noise and interference. Due to the unique imaging mechanism of SAR, it contains a substantial amount of speckle noise, which can affect traditional deep learning models, thereby impacting recognition accuracy. Third, the limitation of data volume. High-quality, large-scale datasets are difficult to obtain due to the high cost of SAR imaging, and the scarcity of samples in existing SAR target datasets exacerbates the issue of insufficient data for model training, while deep learning methods often rely on large datasets. Based on these challenges, this paper employs the Capsule Network as the backbone network for feature extraction [
35], which offers the following advantages in SAR OSR: First, the Capsule Network can effectively encode the pose information of targets, making the system more adaptable to the pose variations of SAR targets. Second, the structural design and the dynamic routing mechanism of the Capsule Network provide stronger robustness against noise and interference in SAR targets, enabling more stable extraction of effective features. Third, the Capsule Network excels in addressing the issues of small-sample learning and data scarcity, with its structure helping to enhance the model’s generalization capability. Based on the above advantages, we use the Capsule Network as the backbone network of the proposed method.
Essentially, the key step in achieving OSR lies in accurately classifying known classes while simultaneously identifying a metric that can effectively differentiate between known and unknown classes. The accurate classification of known classes can be accomplished through the classification capabilities of the Capsule Network, while the discrimination of unknown classes necessitates the identification of an appropriate metric. Kullback–Leibler divergence (KLD) is employed to measure the degree of difference between two distributions. When the similarity between two distributions is high, the KLD value is small; conversely, when the similarity is low, the KLD value is large. For a specific target, the KLD value between it and the targets within its own class is relatively small, while the KLD value with targets from other classes is relatively large. Therefore, when a sample under examination belongs to a known class, it will exhibit a small KLD with a target from that known class. Conversely, when the sample belongs to an unknown class, it will demonstrate a large KLD with all known class targets. This criterion can thus be effectively utilized to reject unknown classes.
In view of the advantages of the KLD metric and the Capsule Network, in this paper, we integrate the Capsule Network with KLD to propose a novel OSR model for SAR targets. Moreover, due to the characteristics of SAR targets, which exhibit small inter-class differences and large intra-class variations, unknown class targets may be proximally located to known class targets, making them difficult to distinguish. Consequently, this paper also designs the loss function, which makes features within the same class more compact and features between different classes more dispersed. This strategy increases the feature separation between known and unknown classes, thereby enlarging the KLD between them and effectively enhancing the performance of the OSR model. The contributions of this paper are as follows.
- (1)
We propose an end-to-end novel OSR model based on the Capsule Network for SAR targets. This method combines the Capsule Network with KLD, utilizing the Capsule Network for feature extraction and calculating the KLD between testing samples and known classes. It achieves the dual objectives of classifying known classes and effectively rejecting unknown classes, thereby addressing the OSR problem.
- (2)
A loss function tailored for OSR has been designed, incorporating margin loss, center loss, and KL loss, which reduces intra-class differences and increases inter-class differences in the extracted features. This leads to an enlarged KLD between known and unknown class features, thereby addressing the OSR problem more effectively.
- (3)
Tested on actual datasets, the recognition rate reaches over 95%. Compared with other traditional methods, the model proposed in the paper has superior performance.
The remainder of this paper is organized as follows.
Section 2 provides an overall framework of the proposed OSR method.
Section 3 provides a detailed description of the feature extraction network proposed in this paper. In
Section 4, the discriminant model based on KLD is described in detail.
Section 5 presents the experiments and test results. Finally, the discussion and conclusion are presented in
Section 6 and
Section 7.
3. Feature Extraction Based on the Capsule Network
3.1. The Architecture of the Capsule Network
In the method proposed in this paper, the Capsule Network is used for target feature extraction. In the processing of target imagery, traditional neural networks, such as Convolutional Neural Networks (CNNs), employ pooling operations to diminish the dimensions of feature maps. However, this reduction can lead to the disruption of spatial relationships among features, which may render the model insensitive to variations in the target’s pose, consequently degrading the model’s performance. The Capsule Network offers a solution to this issue by incorporating the concept of “capsules”, which preserve a greater amount of information regarding the target’s pose through their vectorized outputs. This approach ensures the retention of the target’s structural information and spatial relationships to a greater extent, thereby enhancing the model’s sensitivity and robustness to changes in the target’s pose. Furthermore, CNNs are contingent upon a fixed structure of convolutional kernels, which are incapable of dynamically adjusting to the complex deformations of targets. In contrast, the Capsule Network is equipped with a dynamic routing mechanism that adjusts the connection weights between features, thus providing a more effective adaptation to the intricate deformations of targets. This capability endows the Capsule Network with a superior generalization ability when confronted with targets from various viewpoints. In light of these considerations, we have selected the Capsule Network as the network for feature extraction of targets due to its capacity to maintain spatial hierarchies, its robustness to variations in target pose, and its enhanced adaptability to the complex deformations of targets.
The primary steps involved in data processing by the Capsule Network can be divided into the following stages: the input layer receives data, the convolutional layer extracts image features, the primary capsule layer (denoted as PrimaryCaps layer) encapsulates these features into capsules, and the higher-level capsule layer (denoted as DigitCaps layer) ultimately produces the output. The proposed Capsule Network in this paper comprises two convolutional layers, a PrimaryCaps layer, a DigitCaps layer, and a fully connected layer. Additionally, a multi-scale feature extraction module is introduced. The overall structure of the network and the information pertaining to each component are presented in
Table 1.
Given an input image, the processing procedure can be described as follows: Assuming the input image has dimensions of 128 × 128, it is first fed into the shallow convolutional layers for local feature extraction. Specifically, there are two convolutional layers employed, each with a kernel size of 9 × 9 and a stride of 2. Following the shallow convolutional layers, the features are inputted into the multi-scale feature extraction module to extract diverse features. To avoid affecting the original structure of the Capsule Network, the parameters of the multi-scale feature extraction module are designed, thus the dimensions and quantity of features remain unchanged before and after the input. Following this, the PrimaryCaps layer, succeeding the multi-scale feature extraction module, encapsulates the features extracted by the aforementioned convolutional layers into a capsular representation, utilizing 11 × 11 kernels with a stride of 2. Each capsule produced at this stage has a spatial dimension of 8 × 8. The output from the PrimaryCaps layer is transmitted to the DigitCaps layer, where a dynamic routing mechanism iteratively updates the weight matrix, progressively enhancing the degree of alignment between the input and output capsules. The output of the DigitCaps layer comprises several capsules corresponding to the input categories, with the feature dimensionality set at 16. In order to leverage the features extracted by the Capsule Network, a fully connected layer is appended at the end of the network. This layer transforms the output from the DigitCaps layer into a 128-dimensional feature vector. The processing workflow is illustrated in
Figure 2.
3.2. Multi-Scale Feature Extraction Module
The multi-scale feature extraction module is introduced in this article to enhance the robustness of the system. In the context of image processing using deep neural networks, the salient parts of an image may exhibit significant variations in size, posing challenges in selecting the appropriate convolution kernel size for convolution operations. The multi-scale feature extraction module integrates convolution kernels of different sizes and processes them in parallel, enabling the network to simultaneously capture features at various scales. Smaller convolution kernels can capture fine details, while larger kernels can capture more global contextual information. This module allows the network to observe the input data from different perspectives and scales, thus generating richer and more diverse feature representations. Such diverse feature representations contribute to improving the generalization capability and robustness of the model.
The multi-scale feature extraction module is positioned between the shallow convolutional layer and the primary capsule layer in this study, comprising a total of four parallel convolutional modules. The information pertaining to each convolutional module is presented in
Table 2.
The 1 × 1 and 3 × 3 kernel filters are small convolutional filters used to capture the fine detail features within SAR images, while the 5 × 5 and 7 × 7 kernel filters are larger convolutional filters designed to capture the global features of SAR images. Following the four parallel convolutional modules, the outputs are concatenated along the channel dimension, resulting in a combined feature set. This operation allows the network proposed in the paper to retain detailed features while also extracting more abstract and higher-level information, thereby enhancing the model’s representational capacity. It is important to note that this work omits the pooling layers typically found in multi-scale feature extraction networks to prevent the loss of detail information when processing SAR images, which could adversely affect the model’s performance. This approach aligns with the concepts underpinning the Capsule Network.
3.3. The Loss Function
For classification tasks, the Capsule Network typically employs the margin loss function as the main loss function to guide the training process. The margin loss is a loss function in deep learning, typically employed to enhance the model’s ability to distinguish boundaries between different categories. It serves to augment inter-class variance during training, thereby facilitating better differentiation between categories. The primary role of margin loss is to penalize the distance between samples within the same category while encouraging the model to increase the distance between samples from different categories. Consequently, during training, the model focuses more on inter-class discrimination, thereby increasing inter-class variance and improving the model’s classification performance and generalization ability. Specifically, the design of margin loss often incorporates a margin parameter, which governs the difference between distances among samples within the same category and those from different categories. By adjusting this margin parameter, samples within the same category can be made more compact, while those from different categories can be more dispersed, thereby enhancing inter-class discrimination.
Specifically pertaining to the Capsule Network proposed in this paper, the calculation process of the margin loss can be described as follows: The loss for an input sample is computed for each class individually and then aggregated, ensuring that the length of the capsule output vector for the correct category exceeds a threshold
, while the length for incorrect category capsules remains below a threshold
. This approach is intended to bolster the confidence of the model in the correct categories whilst diminishing assurance in erroneous ones. Suppose for a given sample
, the Margin Loss
for category
can be articulated as:
where
serves as an indicator, taking the value of 1 if the category
is the correct category, and 0 otherwise;
denotes the length of the capsule output vector for the sample
corresponding to the category
; and
represents a scaling parameter utilized to balance the contribution of positive and negative class losses, thereby preventing a disproportionately small contribution of positive class loss to the total loss during the early stages of training.
Assuming the training samples encompass
categories in total, the margin loss for the sample
can be expressed as:
To reduce intra-class variation in the extracted features, the center loss is introduced. The center loss measures the distance between the sample features and the centroid of their respective classes. By incorporating the center loss, features within the same class can be made more compact. Given the sample
, the center loss can be represented as:
where
is the center of the category
to which sample
belongs;
denotes the feature of the sample
extracted by the Capsule Network; and
represents the Euclidean norm (also known as the L2 norm).
Furthermore, during the training phase, a KLD constraint term is introduced to ensure that the KLD within the same class is minimized and the KLD between different classes is maximized. This is aimed at increasing the inter-class KLD differences while reducing the intra-class KLD differences. Specifically, a regularization term is added to the original loss function, referred to as KL loss. The KL loss penalizes excessive KLD between samples of the same class and rewards increased KLD between samples of different classes.
Due to the requirement of calculating the KLD between different samples, it is not feasible to compute the loss for an individual sample independently. Consequently, a batch is considered as a single processing unit, where the KLDs between all samples within the batch are computed simultaneously. The average of these divergences is then taken to obtain the KL loss for the batch. The specific computational procedure is as follows:
Within a given batch, all sample pairs are exhaustively traversed, with the count of intra-class sample pairs denoted by
and that of inter-class sample pairs denoted by
.The KLD across all sample pairs is initially computed. The aggregate KLD for the intra-class and inter-class sample pairs are respectively calculated, and then the mean value is computed for both sets:
where
represents the KLD of
from
.The KL loss can be expressed as follows:
where
is a balancing factor that controls the influence of the KLD constraint term.
The final loss function for a batch can be expressed as:
where
,
, and
are the coefficients corresponding to the three loss functions, respectively; and
denotes the sum of the samples in a batch.
With the designed loss function, the model is capable of maintaining high-precision classification for known classes while simultaneously achieving reduced intra-class variance and increased inter-class variance among the extracted features of known classes. Consequently, this facilitates the distinction of unknown classes from known classes with greater ease.
5. Experiments and Results
In this section, multiple experiments were conducted using the SAR-ACD dataset [
36] to validate the effectiveness of the proposed method. Firstly, an introduction to the SAR-ACD dataset utilized in the experiments was provided. Secondly, by setting three fixed known classes, experiments were conducted with varying numbers of unknown classes (1, 2, and 3) to validate the effectiveness of the model. Finally, to assess the advancement and robustness of the model, experiments were performed under three scenarios, where the number of known classes ranged from 5 to 3 while the number of unknown classes ranged from 1 to 3, and the results were compared with several methods.
5.1. Experiment Setup
5.1.1. Dataset
In this paper, we conduct experiments using the SAR-ACD dataset, a challenging SAR aircraft category dataset characterized by high scene complexity. This dataset has a resolution of 1m and includes six types of aircraft targets: A220, A320/321, A330, ARJ21, Boeing737, and Boeing787, comprising a total of 3032 images with approximately 500 images for each category. The optical images and corresponding SAR images of six classes of targets in the SAR-ACD dataset are shown in
Figure 3.
To facilitate training and testing, images of each type were randomly divided into training and testing sets at a ratio of 6:4. The number of targets in each category and the number of train and test sets after division are shown in
Table 3.
Since the dataset images are of different sizes, they are uniformly resized to 128 × 128 before inputting into the network for the convenience of training and testing.
5.1.2. Metrics
To measure the openness of the OSR issue, this paper introduces the metric Openness [
37], which is calculated using the following formula:
where
represents the number of target classes in the training set, and
denotes the number of target classes in the test set. The larger the value of openness, the greater the proportion of unknown classes in the OSR task.
To qualitatively assess the performance of the proposed OSR method in the SAR ATR task, four evaluation metrics are employed: accuracy, precision, recall, and F1-score. The metrics are calculated as:
In the aforementioned equations, TP stands for True Positives, representing the number of instances correctly predicted as positive by the model. TN denotes True Negatives, which corresponds to the number of instances correctly predicted as negative by the model. FP signifies False Positives, referring to the number of instances incorrectly predicted as positive by the model. Lastly, FN represents False Negatives, indicating the number of instances incorrectly predicted as negative by the model. Additionally, the confusion matrix is utilized to render the results more intuitive.
The experiments were implemented on a laptop with an Intel Core i5-13500H CPU, a NVIDIA GeForce RTX 3050 4GB Laptop GPU, and 16 GB RAM on the Windows 11 system. In the experiment, the learning rate is set to 0.00003, the batch size is 16, the epoch is 200, and the number of iterations of the dynamic routing mechanism is 4.
5.2. Three-Class Open-Set Recognition
To evaluate the effectiveness of the model proposed in this paper, we conducted experiments under three different scenarios. The known categories (denoted as kn) were fixed at three types: A320/321, ARJ21, and Boeing737. The unknown categories (denoted as un) were set as one category (A330), two categories (A330, Boeing787), and three categories (A220, A330, Boeing787), respectively.
During the training phase, three known categories are input into the Capsule Network, resulting in a trained model along with the corresponding features for each of the three known categories. For each category of features, the KLD between the feature of each sample and the centroid of the sample features for that class is computed. The greatest KLD value across all samples is established as the KLD threshold for that class. Subsequently, the largest threshold among the three category thresholds is selected as the threshold for known classes.
In the test phase, the sample is input into the preserved model, which then outputs the predicted category along with its associated features. The KLD between the features of the test sample and the centroid of the features for each known class is calculated. If the divergence value exceeds the threshold, the sample is determined to be from an unknown class. If not, the predicted category outputted by the network is used as the prediction label. The results obtained are presented in
Table 4. The confusion matrices of the three scenarios are in
Figure 4.
The results in
Table 3 and
Figure 4 indicate that the method proposed in this paper is capable of obtaining a classification accuracy of over 98% for known categories, while also achieving a high recognition rate for unknown categories, with an F1-score maintained above 95%. This demonstrates the effectiveness of the proposed method in accomplishing the OSR task.
5.3. Multiclass Open-Set Recognition
To validate the robustness of the proposed method, three scenarios were established with varying compositions of datasets, where the known categories were set as 5, 4, and 3 classes, respectively, and the corresponding unknown categories were set as 1, 2, and 3 classes. To demonstrate the advancement of this method, it was compared with several other methods, including MGPL [
38], OpenMax [
39], CROSR [
40], CAVECapOSR [
41], and CAC [
42]. The results obtained are presented in
Table 5.
From the results, it can be observed that the method proposed in this paper exhibits significant improvements compared to other methods in terms of accuracy, precision, recall, and F1-score. Especially when unknown targets increase, our method still maintains a high level, while the performances of other methods drop dramatically.
In addition, we compare the above methods with the method in this paper in terms of computational complexity in the third scenario. Computational complexity can be divided into time complexity and space complexity. For the convenience of quantitative analysis, we use the number of floating-point operations (FLOPs) as the indicator of time complexity and use the number of parameters and the size of parameters as the indicators of space complexity [
43]. The obtained results are shown in
Table 6.
From the analysis of the results, it can be obtained that our method has the smallest FLOPs, that is, the algorithm has a lower time complexity. The reason for the analysis is that compared with the deep networks in other methods, the number of network layers of the algorithm in this paper is relatively small, so the FLOPs are smaller. However, in terms of the number of parameters and the size of parameters, although the number of network layers of this method is shallower, the number of parameters and the size of parameters are 10 times that of the CAC method and are basically on the same order of magnitude as the other methods. This is mainly because the dynamic routing mechanism needs to be iterated, and this mechanism can effectively improve the performance of the model.