1. Introduction
Aerospace technology is one of the most cutting-edge fields in science and technology, and even more of an essential manifestation of a country’s scientific and technological level and comprehensive national strength [
1]. The identification and pose estimation technology of non-cooperative targets plays a vital role in the aerospace field. How to accurately identify and estimate the pose of non-cooperative targets is now basically the agenda for the aerospace field [
2]. Although the pose estimation technology of non-cooperative targets possesses broad application prospects in the areas of space service and space junk cleaning, a vast array of tremendous challenges still lie before us due to the lack of information communication [
2,
3]. The pose estimation technology of non-cooperative targets based on monocular vision has attracted extensive attention from researchers due to its advantages of low power, low mass, and small size, and thus has been exhaustively studied [
4]. By comparison, monocular vision exceedingly surpasses the binocular pose estimation scheme in both field of view and reliability. As a result, we mainly focus on non-cooperative target recognition and pose estimation of monocular vision.
At present, pose estimation methods for the non-cooperative targets are basically categorized into two types, including traditional methods of artificially designing geometric features and end-to-end processes based on deep learning [
5,
6]. Traditional pose estimation methods are based on image processing algorithms and a priori knowledge of the pose [
7]. The critical point of this method is to extract feature points and manually obtain their corresponding descriptors, namely Scale Invariant Feature Transform (SIFT) [
8], Maximally Stable External Regions (MSER) [
9], Speeded Up Robust Features (SURF) [
10] and Binary Robust Independent Elementary Features (BRIEF) [
11]. With the Canny operator and Hough transforms extracting edge features and line features from grayscale images of non-cooperative targets, D’Amico et al. [
12] combined the known 3D models to construct pose parameters and the function of the pixel point, and consequently obtained the pose quantity. Based on the monocular camera image, Sharma et al. [
13] introduced a hybrid image processing method in the attitude initialization system, which combined the weak gradient elimination technology with the Sobel operator and Hough transform to detect the dimension characteristics of the target. In this way, they obtained its attitude information, but on the other hand, this method is challenging when dealing with complex background images.
The deep learning methods are divided into indirect methods and direct methods. The indirect method uses deep learning in place of the artificial design of points, lines, and other features of the target, and then solves the pose according to the 2D–3D relationships. The Render CNN classification model is adopted to predict the viewpoint. The attitude estimation is carried out by training the convolutional neural network and fine-grained classification of three angles (Pitch Angle, Yaw Angle, and Roll Angle) for the pose estimation [
14]. The PoseCNN model of object pose estimation is divided into two branches. One branch, located in the center of the object, predicts the distance between it and the camera so as to estimate the translation vector of the object, while the other branch estimates the three-dimensional rotation Angle of the object through regression quaternion [
15]. Sharma et al. [
6] used threshold segmentation to remove background pixels. However, what still sounds the alarm is the problem of generating false edges and being unable to eliminate background interference completely. In view of the problem, Chen et al. [
16] first detected the target and cropped it from the background, then used the High-Resolution Network (HRNet) to predict the 2D critical points of the target, and thereby established the 3D model of the target according to the multi-view triangulation method. Finally, they used PnP to solve the 2D–3D relationship to obtain the pose of the target. Nan et al. [
17] operated a high-resolution, multi-scale prediction deep network HRNet to replace the feature pyramid network so that the problem of information loss caused by the reduction of resolution can be well-handled. They finally enhanced the detection ability of spatial objects of different scales. In addition, applying the Transformer model to the rigid target pose estimation task, Wang et al. [
18] first proposed a representation method based on sets of critical points, and designed an end-to-end critical point regression network model to enhance the relationship between critical points.
The direct method can directly obtain the pose information without figuring out the 2D–3D relationship. The Spacecraft Pose Network proposed by Sharma et al. [
19] had three branches. The first branch performed target detection to determine the bounding box of the target device, yet the other two branches used the bounding box area to calculate the relative pose, and then used the relative pose constraints to calculate the phase position. UrsoNet, proposed by Proença et al. [
20], treated ResNet as the backbone network, position estimation as a superficial regression branch and pose estimation as a classification branch, and used probabilistic soft classification and Gaussian mixture model to solve pose.
To sum up, when solving the pose estimation problem of non-cooperative targets, the methods of artificially designing features have significant limitations and insufficient robustness, the indirect method in deep learning has high accuracy but complicated calculation process, while the model of the direct method is more straightforward but less accurate than the indirect method. In the rendezvous and approach scenarios, the shipboard resources of the spacecraft are so limited that it brings out stringent requirements on the calculation speed and pose estimation accuracy. Therefore, the direct method, with advantages in memory usage and computing power consumption, is chosen as the pose estimation method for non-cooperative targets. We further improve the original method and improve the attitude estimation accuracy of the direct method. The direct method used in this paper is further improved and optimized to achieve higher positional recognition accuracy.
The structure of the following parts of the paper is as follows:
Section 2 introduces the theoretical basis of the network structure of target detection and pose estimation;
Section 3 reports the specific experiments, testing, and verification procedures, subsequently comparing them with other networks, which fully illustrates the superiority of our network; as a conclusion,
Section 4 summarizes the full text and makes arrangements for the future work.
2. Proposed Method
The non-cooperative target pose estimation method proposed in this chapter is shown in
Figure 1. It mainly goes through two stages: first, the target detection network is used to predict the bounding box of the target device, and the target is cut out to remove background interference such as deep space and the earth, so that the next stage of the network focuses on the pose estimation problem; then, the cropped target image is input into the pose estimation network, and the position information and attitude information of the target is predicted by the direct method.
In the target detection algorithm, the two-stage algorithm has a high detection accuracy, as in Faster R-CNN [
21]; the speed of the one-stage algorithm is fast, as in YOLOv3-tiny [
22], YOLOv4 [
23], YOLOv5 [
24], and YOLOv5s [
25]. YOLOv5s is small, fast in calculation, and the detection accuracy is no less than Faster R-CNN. High-Resolution Network (HRNet) [
26] can effectively accomplish space-sensitive tasks, but its network is complex and takes up a large number of memory resources.
In the rendezvous and approach missions, the equipment carried by spacecraft is expensive, the memory resources and computing power are limited, and there is a high requirement for real-time performance. To solve this problem, we select YOLOv5s as the target detection network and use the improved HRNet as the pose estimation network. The improved HRNet has a lightweight structure while improving the prediction accuracy.
2.1. Target Detection Network Based on YOLOv5s
The structure of YOLOv5s is shown in
Figure 2. It consists of an input layer, a backbone layer, a neck layer, and an output layer. The network is small in size and fast in inference, making it suitable for intersection and approximate tasks.
The input layer includes Mosaic data enhancement, self-adaptive anchor box calculation, and self-adaptive image scaling. It has a strong ability to recognize small objects, which is suitable for cases where the satellites are small targets in the far-field of view, as in this section.
The backbone layer includes CSP1_X and Focus. Focus is shown in
Figure 3, and its function is to use the slicing operation shown in
Figure 4 to turn the 4 × 4 × 3 feature map into a 2 × 2 × 12 feature map. CSP is shown in
Figure 5, and can solve the problem of gradient disappearance in the network. YOLOv5s achieves excellent accuracy and calculation speed due to the use of CSP, and reduces the model’s size.
The neck layer is shown in
Figure 6. The FPN, PAN, and CSP2_X modules enhance the network’s feature fusion ability and improve the accuracy of network prediction.
The weighted non-maximum suppression and loss function used in the output layer solve the problems of multi-target occlusion and bounding box mismatch, respectively. The loss function [
27] is:
where:
IoU is the intersection ratio, which is used to measure the degree of overlap between the actual frame and the predicted frame;
GIoU is the generalized intersection ratio, which is used to measure the target detection effect;
A and
B are the areas of the predicted frame and the actual frame, respectively;
C is the area of the smallest bounding box of
A and
B.
2.2. High-Resolution Network
The network structure of HRNet is shown in
Figure 7. Its four parallel sub-networks perform multi-resolution feature fusion in four stages so that the feature maps always keep the high resolution.
The four parallel sub-networks of HRNet are shown in
Figure 8. Layer 1 always maintains high resolution, and the depth of the next layer of features is increased, but the resolution is reduced by half. In
Figure 8,
N represents the resolution of the image,
s represents the number of stages, and
r represents the subnetwork.
Nsr represents the resolution of the
r-th subnet in stage
s and the resolution of each subnet is 1/(2
r−1) the resolution of Layer 1.
Each subnet can repeatedly receive characteristic information of other subnets through the switching unit in the blue box in
Figure 7. The exchange unit of Stage 3 is shown in
Figure 9. In
Figure 9,
represents the convolution unit of the
-th subnet of the
-th exchange block of stage
, and
is the corresponding exchange unit. The operation formula of the exchange unit is:
where:
is the input;
is the output;
indicates that the resolution is from
to
upsampling or downsampling the input
. The fusion process is shown in
Figure 10. Feature maps of different resolutions undergo upsampling or downsampling to fuse cross-scale information always to maintain the high resolution of the feature maps and supplement the high-resolution features multiple times.
The residual block used by HRNet is shown in
Figure 11. The two 1 × 1 convolutions of Bottleneck are used for dimension reduction and dimension improvement, and the 3 × 3 convolution in the middle layer is used to extract features; the two 3 × 3 convolutions of Basicblock are used to extract features. Convolution adds the output information. Residual blocks increase the depth of the HRNet network, allowing it to extract deeper information.
2.3. Pose Estimation Network Based on scSE-LHRNet
Although HRNet’s parallel sub-networks and repeated feature fusion can maintain high-resolution feature maps, it has the following problems: the complex network model leads to a large number of parameters, a large number of repeated calculations, and a large amount of memory; and the direct fusion of multi-resolution features cannot effectively utilize the channel feature information and spatial feature information of the feature map.
In response to the above problems, this chapter proposes a Lightweight HRNet (LHRNet) with scSE with concurrent spatial and channel squeeze and excitation block (scSE) [
28]. It it is used for pose estimation of non-cooperative targets.
As shown in
Figure 12, the backbone network of scSE-LHRNet is HRNet, and the residual blocks are lightweight Bottleneck and Basicblock. In the multi-resolution feature fusion stage, we add the scSE module with a hybrid attention mechanism, and connect two full-scale modules at the network’s end. The full connection layers perform position estimation and attitude estimation, respectively.
Standard convolution and depthwise Separable Convolution (DSC) [
29] are shown in
Figure 13. If the input is
and the output is
, then there are N convolution kernels of
, and the ordinary convolution calculation amount is
. Depthwise convolution with
M convolution kernels of
and pointwise convolutions with
N convolution kernels
make up DSC, and the amount is
. The ratio of them [
29] is:
It can be seen that using DSC to perform the convolution operation of Bottleneck and Basicblock can significantly reduce the number of network parameters and computations, and the complexity of the constructed multi-resolution subnet structure is reduced [
29]. This makes HRNet more lightweight.
The attention mechanism enables the network to focus on extracting the vital feature information. The scSE belongs to a hybrid attention mechanism, which can focus on both spatial and channel features. Its structure is shown in
Figure 14.
The sSE that focuses on the spatial domain first converts feature maps into feature maps after convolution dimensionality reduction and Sigmoid activation, and then re-calibrates the features and integrates them into to obtain a new feature map after spatial information calibration. The cSE that pays attention to the channel domain first converts it into feature maps after global pooling. The number of channels is halved by the convolution and ReLU is used to restore the original number of channels. Sigmoid is used to obtain the mask, and finally, the features are re-calibrated and integrated into , a new feature map after channel information calibration is obtained. The results of the two modules are added to obtain the final feature map of the scSE module .
Adding the scSE block before the multi-resolution feature fusion of HRNet can extract more useful spatial feature information and channel feature information, making the fusion of different resolution features more efficient.
The scoring criteria used in the Kelvins Pose Estimation Challenge (KPEC) [
30] competition are thus: the
i image poses a score of
, a position score of
, an attitude score of
, and a total score of
on the dataset, as follows:
where:
and
are the quaternions of the predicted value of the pose and the real value, respectively;
and
are the translation vector of the predicted value of the position and the real value, respectively; the smaller the value of
, the more accurate the attitude estimation.
This paper uses the KPEC dataset, and the pose estimation is a direct method to complete the position estimation and state estimation. In order to obtain the same level of accuracy as the competition, the network loss function is:
where:
and
are adjustable parameters. The experiment shows that when
is 0.6 and
is 0.4, the accuracy of the pose estimation network is the highest. It has been verified by experiments that when
and
take other values, the prediction accuracy of the network will be slightly reduced.
4. Conclusions
This paper completed the design of a deep-learning-based non-cooperative target pose estimation method. First, we introduced the general framework of the pose estimation method; then, we adopted YOLOv5s with a small model and fast computation speed as the target detection network for non-cooperative spacecraft; finally, we utilized DCS and scSE modules to improve HRNet’s multi-resolution sub-network and multi-resolution fusion network. The resolution fusion part made the network model lightweight and further improved the prediction accuracy. The experimental results on the SPEED dataset showed that, compared with some advanced methods, the pose estimation method in this paper has achieved a good balance between the number of parameters, the amount of computation, and the accuracy, and had effectiveness and superiority.
Since the dataset we used is static images rather than real space scenes of the actual movement of the spacecraft, the application scope is relatively narrow. In the follow-up research, we will apply the method in this paper to a video dataset to continue to improve the generality of pose estimation. We will continue to explore new efficient pose methods to improve pose detection performance.