1. Introduction
With the increase in human space exploration, the number of micromotion space targets, such as satellites, warheads, and space debris, is increasing. Radar is the main sensor used for space target detection, tracking, and recognition. In the radar automatic target recognition field, extracting target information (such as shape, structure, and micromotion characteristics) from radar echoes for space target recognition has attracted widespread attention [
1,
2,
3,
4].
The existing methods for space target recognition can be categorized as model-driven methods [
2,
5,
6] and data-driven methods [
7,
8,
9]. The model-driven methods require handcrafted features, leading to high complexity and weak generalizability [
10]. The data-driven methods based on deep learning do not require manually designed features and can extract discriminative target features from the data automatically. Researchers have begun using deep learning methods to recognize space targets, achieving higher recognition accuracy than model-driven methods. Notably, these deep learning methods can recognize only targets belonging to the seen classes during training. When unseen classes of space targets appear during the testing phase, these methods become ineffective, thereby limiting the practical applications of the algorithms. In the real world, encountering targets during the testing phase that are not seen during the training phase is likely, especially for non-cooperative targets whose samples cannot be obtained in advance. Therefore, methods for recognizing unseen classes that appear during the testing phase must be studied.
Inspired by human learning of new concepts, zero-shot learning (ZSL) techniques have attracted increasing interest [
11]. Deep learning-based supervised classification methods can recognize only targets belonging to the seen classes during the training phase. ZSL recognizes unseen classes in the testing set by transferring semantic knowledge from seen to unseen classes, where there are no available training samples. To our knowledge, no space target recognition methods based on ZSL currently exist.
In the computer vision field, existing ZSL methods for target recognition can be categorized into three main types: early embedding-based methods [
12,
13,
14,
15], generative-based methods [
16,
17,
18,
19], and part embedding-based methods [
20,
21,
22,
23]. Early embedding-based methods learn mapping spaces between the visual features of seen classes and their semantic descriptions. They recognize unseen classes by mapping them to these spaces and conducting nearest-neighbor searches. One major issue with these methods is the potential bias problem [
24]. To alleviate this issue, generative-based methods utilize generative adversarial networks (GANs) to generate features for unseen classes, thereby transforming ZSL into fully supervised classification problems. Although these methods have exhibited certain improvements in unseen class recognition accuracy, they consider only global visual features and cannot capture the local regions in images that correspond to semantic descriptions, limiting the transfer of semantic knowledge. Recently, some part embedding-based methods have achieved higher recognition accuracy by incorporating attention mechanisms or obtaining discriminative visual features from local regions in images guided by semantic attributes. However, we believe that both global and local visual features play important roles in ZSL. Additionally, due to the lack of semantic attributes specifically defined for space targets, it is difficult to directly apply existing methods to the ZSL of space targets. Therefore, we attempt to design binary semantic attributes for space targets and propose a novel ZSL method that simultaneously considers global and local visual features, thereby improving the ZSL capability for space targets.
In this article, an end-to-end framework for generalized zero-shot space target recognition, termed the Global-Local Visual Feature Embedding Network (GLVFENet), is proposed. This framework is used to simultaneously recognize both seen and unseen space targets during the testing phase. First, we devise prior binary semantic attributes for each space target category, which is the key to ZSL. Next, we introduce a CNN backbone network that extracts initial features from raw high-range resolution profile (HRRP) sequences. The output of this network is then passed to two subsequent subnetworks. Then, we develop a global visual feature embedding subnetwork (GVFE-Subnet) that can further capture global visual features from the output of the CNN backbone network and map these features to the semantic space. By using cosine similarity to calculate the compatibility scores with the ground-truth semantic attributes, we obtain the global visual embeddings. Because the discriminative visual properties in the images are the key to transferring knowledge from seen classes to unseen classes, we designed a local visual feature embedding subnetwork (LVFE-Subnet). LVFE-Subnet utilizes soft space attention and an encoder to localize discriminative local regions in the images through semantic knowledge, obtaining local visual embeddings. Finally, we optimize the full network by using a joint loss function and consider both global and local visual embeddings to achieve GZSL recognition of space targets. The following are the contributions of this article:
(1) A framework for GZSL of space targets is proposed for recognizing targets of both seen and unseen classes. The framework consists primarily of a GVFE-Subnet and an LVFE-Subnet, and the dual-branch network improves the model’s ability to transfer knowledge from seen classes to unseen classes.
(2) To our knowledge, this is the first attempt to utilize ZSL techniques for space target recognition. To achieve ZSL, we incorporate expert prior knowledge in the space target field and design binary semantic attributes to describe the characteristics of different categories of space targets, forming the basis for inferring unseen classes.
(3) The results of comparative and ablation experiments based on electromagnetic (EM) calculation data demonstrate the effectiveness of the proposed method.
The remainder of the article is organized as follows:
Section 2 introduces the related work on space target recognition and ZSL.
Section 3 provides a detailed description of the proposed GLVFENet.
Section 4 describes the dataset generation, evaluation metrics, and implementation details and presents the experimental results.
Section 5 includes discussions. Finally,
Section 6 concludes the article.
3. Proposed Method
In this section, a formulaic definition of the task is first provided, then the semantic information proposed for space target recognition is introduced, and finally a detailed description of the proposed method is presented.
3.1. Problem Formulation
Let be the seen class set, where denotes an HRRP sequence with duration and the number of single HRRP range cells , denotes the label corresponding to , denotes the set of labels of the seen class, denotes the number of seen classes, denotes a semantic vector of seen classes with dimensions, and denotes the number of samples of seen classes. Similarly, the dataset of unseen classes can be denoted as , where the HRRP sequence is available only in the testing phase, denotes the labels corresponding to , denotes the set of labels of the unseen classes, denotes the number of unseen classes, and denotes an unseen class semantic vector. denotes the sample space consisting of all seen and unseen classes, denotes the label space, and , denotes the semantic space. In the GZSL task, all the training classes come from , and the test dataset can be denoted as , where denotes the number of samples in the test set. The goal is to train a model to recognize samples from .
3.2. Semantic Representation
The semantic information is crucial for ZSL. Since samples from unseen classes are unavailable during the training phase, establishing a relationship between seen and unseen classes through semantic information is necessary. However, to our knowledge, no semantic information specifically designed for describing space targets currently exists. In this section, we attempt to design semantic information for the ZSL of space targets based on the HRRP sequence. The designed semantic information consists of multiple binary semantic attributes and follows the following design principles: (1) Shareability: Each attribute can be shared by multiple categories, and the semantic space formed by all attributes can be shared by both seen and unseen classes. (2) Distinguishability: The semantics of each category, formed by attributes, must be distinct, providing accurate guidance for the network. (3) Interpretability: The designed attributes should reflect discriminative visual features and be interpretable or understandable.
Following the semantic design principles, we define binary semantic attributes for space targets. As the HRRP sequence contains both geometric and micro-Doppler characteristics of the targets, we mainly focus on these two aspects when defining the binary semantic attributes. Specifically, we define six obvious appearance attributes, including two trajectories, three trajectories, four trajectories, sine curve, amplitude size, and coupling degree. These attributes can be directly observed from the HRRP sequences. However, they are not sufficient for distinguishing all nine categories of targets. We thus also add three derived semantic attributes, including double rotational symmetry, nutation, and wobble. Although these attributes cannot be directly obtained from the images, they help distinguish the targets. Finally, we group these nine binary semantic attributes into geometric features and micro-Doppler features, as shown in
Table 1. The value “1” indicates that the class of targets has this attribute, while “0” indicates that the class of targets does not have this attribute.
3.3. GLVFENet
As shown in
Figure 1, GLVFENet consists of three parts: a shared feature extraction module, GVFE-Subnet, and LVFE-Subnet. The shared feature extraction module is used to extract initial features from the raw HRRP sequences, and its outputs are then fed into the two subsequent subnetworks. The GVFE-Subnet uses cosine similarity to calculate the compatibility scores between the projection of global visual features in the semantic space and the semantic vectors of each class. Under the supervision of the loss function, the GVFE-Subnet learns the global visual embeddings in the image. Since the visual features learned by the GVFE-Subnet are coarse-grained, distinguishing between seen and unseen classes with subtle differences in global feature spaces is difficult. Therefore, we design an LVFE-Subnet. This subnetwork utilizes soft space attention and an encoder to obtain semantically guided local visual embeddings. Finally, we summarize the end-to-end overall loss function and provide a prediction method for sample labels.
3.3.1. Shared Feature Extraction Module
CNN is a commonly used backbone network in visual tasks [
48] that is known for its strong feature representation ability. To this end, a CNN is used as a feature extraction network at the beginning of the network to capture features from the raw HRRP sequence. Let the
i-th sample of the HRRP sequence fed into the network be denoted as
, where
denotes the duration of the HRRP sequence and
denotes the number of range cells of a single HRRP. The HRRP sequence representation obtained after CNN operation is denoted as
, where
,
and
denote the number of channels, height, and width of the feature map, respectively, and
denotes the convolution operation. Next,
is fed into GVFE-Subnet and LVFE-Subnet.
3.3.2. Global Visual Feature Embedding Subnet
The global visual feature embedding subnet (GVFE-Subnet) is employed to further capture global visual features from HRRP sequence representations and map these features to semantic spaces. Then, in the cosine metric space, the compatibility scores between visual feature embedding and semantic vectors are calculated, and coarse-grained global visual embeddings are obtained under the supervision of the loss function.
First, a global average pooling operation is performed on the input
of this subnetwork. The result
can be expressed as follows:
where
represents the average pooling operation.
The compatibility-based approach is widely used in ZSL tasks. In this approach, a compatibility function that measures the degree of compatibility between the image representation and the semantic vector of a sample is learned. The sample is labeled as the class corresponding to the semantic vector with the highest compatibility score with the image representation. Specifically, a learnable weight
is first applied to
to project global visual features into the semantic space.
Then, the compatibility score between the visual projection
and each semantic vector is computed as follows:
where
represents the compatibility function.
Cosine similarity is commonly used to calculate the similarity of text embeddings, which helps limit and reduce the variance of neurons, resulting in models with better generalizability [
49]. We use cosine similarity to predict which class
belongs to. In this way, the output of cosine similarity is the cosine value of the compatibility score between visual embedding
and each semantic vector
. Therefore, the probability of label
for sample
can be expressed as follows:
where
represents the scaling factor.
Finally, a loss function based on cross-entropy is used to encourage the visual projection of input samples to have the highest compatibility score with their corresponding semantic vectors. This function is defined as follows:
where
represents the number of samples in a mini-batch.
3.3.3. Local Visual Feature Embedding Subnet
The GVFE-Subnet captures the overall global visual features of the targets. However, when subtle differences in the global feature space exist between seen and unseen classes, relying solely on the global space may make distinguishing these subtle differences difficult. The local regions in an image can be abstracted into semantic attributes at high network levels, and the local regions corresponding to semantic attributes have stronger discriminative ability [
50,
51]. Therefore, we designed a local visual feature embedding subnet (LVFE-Subnet) based on a soft spatial attention [
20] and a visual-semantic embedding encoder to obtain fine-grained local visual embeddings in the image guided by semantics.
(1) Soft spatial attention: Soft spatial attention is used to map the output
of the shared feature extraction module into
K attention feature maps. A 1 × 1 convolution is first used to perform a convolution operation with
and generate
K attention mask maps
via a sigmoid activation function:
where
and
represent the 1 × 1 convolution and sigmoid activation function operations, respectively, and
can be sliced from
.
Then, we extend
to the same dimensions as
and perform the corresponding channel-by-element multiplication operation with
to obtain the
K attention feature maps
corresponding to the
K attention mask maps.
can be computed by the following equation:
where
represents the expansion of
into the same size as
and
represents element-by-element multiplication.
(2) Visual-semantic embedding encoder: The visual-semantic embedding encoder is used to encode
K attention feature maps into local visual embeddings. This encoder consists primarily of a convolution-pooling structure and a linear transformation layer. First, the attention feature maps
(
) are concatenated along the channel dimension to obtain the concatenated attention feature map
:
where
represents the concatenation operation.
Then, the
C 1 × 1 convolutions are used to perform convolution operations with
, and global average pooling is performed to obtain:
Finally,
is multiplied by a learnable linear transform factor
to obtain the local visual embeddings, which can be expressed as follows:
Similarly, local visual embeddings are obtained using a loss function based on cross-entropy and guided by semantic knowledge. This function is defined as follows:
3.3.4. Joint Optimization of the GLVFENet and Prediction
The full model is jointly trained in an end-to-end manner. Therefore, the shared feature extraction module, GVFE-Subnet, and LVFE-Subnet are simultaneously optimized using the following objective function:
where
is a hyperparameter. The optimal parameters of the network can be obtained by minimizing the loss function.
After the model is trained, we fuse global and local embeddings and use the fusion results to predict the test samples. First, we obtain the embeddings of the test samples in the semantic spaces of GVFE-Subnet and LVFE-Subnet, denoted as
and
, respectively. Then, we use a pair of fusion coefficients (
and
) to combine these embeddings. Finally, the label of the input sample is predicted by matching it with the semantic vectors of each class and applying a calibrated stacking method, as shown in the following equation:
where
is the indicator function (i.e.,
if
is a known class and 0 otherwise), and
is a calibration factor adjusted on the validation set.