1. Introduction
In one, docking is defined as “when one incoming spacecraft rendezvous with another spacecraft and flies a controlled collision trajectory in such a manner to align and mesh the interface mechanisms”, and [
1] defined docking as an on-orbital service to connect two free-flying man-made space objects. The service should be supported by an accurate, reliable, and robust positioning and orientation (pose) estimation system. Therefore, pose estimation is an essential process in an on-orbit spacecraft docking operation. The position estimation can be obtained by the most well-known cooperative measurement, a Global Positioning System (GPS), while the spacecraft attitude can be measured by an installed Inertial Measurement Unit (IMU). However, these methods are not applicable to non-cooperative targets. Many studies and missions have been performed by focusing on mutually cooperative satellites. However, the demand for non-cooperative satellites may increase in the future. Therefore, determining the attitude of non-cooperative spacecrafts is a challenging technological research problem that can improve spacecraft docking operations [
2]. One traditional method, which is based on spacecraft control principles, is to estimate the position and attitude of a spacecraft using the equations of motion, which are a function of time. However, the prediction using a spacecraft equation of motion needs support from the sensor fusion to achieve the highest accuracy of the state estimation algorithm. For non-cooperative spacecraft, a vision-based pose estimator is currently developing for space application with a faster and more powerful computational resource [
3].
From this demand, the computer vision field is currently developing as an alternative way for estimating the pose of a spacecraft. A vision-based detection system is a non-cooperative method that takes images of a target object using a camera and then processes them using estimation software. The estimator extracts numerical data from the images based on the constructed relation. When a mathematical model is unavailable, a deep learning algorithm can construct an empirical mathematical model by learning from the data samples. The resulting mathematical model represents the relation between the input image data and the numerical output data. A vision-based estimator needs input and output data samples instead of the exact relationship among the training parameters. The primary precondition of deep learning algorithms is that they need massive amounts of data for training, and the cost of acquiring real spacecraft image data is exceptionally high. Undoubtedly, there identifying the position and attitude while real photos are being taken is problematic. However, pretrained convolutional neural network models, are available that require less data for fine tuning. Many researchers prefer to use public data instead of generating the data themselves because the public data have been well validated and are ready to use. Thus, using public data to construct the estimation algorithm is an excellent choice.
2. Related Works
Currently, deep learning algorithms are widely applied to aerospace information engineering problems. Moreover, applications involving deep Convolutional Neural Network (CNN) architectures have been demonstrated in many studies, for example, processing satellite images to detect forest-fire hazard areas [
4], estimating and forecasting air travel demand [
5], determining the crack length in aerospace-grade aluminum samples [
6], aircraft maintenance and aircraft health management applications [
7], and so on; however, their applications in pose estimation is limited compared with aerospace information applications. In this study, we apply deep learning to solve the problems involved in spacecraft pose estimation. Several pose estimation methods have been demonstrated in various fields in prior studies.
The pose estimation of spacecraft has been a problem of considerable interest in various applications. In satellite image registration tasks via push-broom sensors, the variations in registration shifts occur when the attitude of the satellite is changed. Bamber et al. [
8] constructed the attitude determination model for a low-orbit satellite by modeling the changes in attitude and rates of the image registration shifting. Before the deep learning algorithm became well known, there was an attempt to apply the computer vision technique via artificial intelligence to estimate the spacecraft poses. Casonato and Palmerini [
3] demonstrated the application of artificial intelligence in low-level processing to detect the edges of an Automatic Transfer Vehicle (ATV). After the edge detection, a Hough transformation was employed to identify the basic shape of the vehicle, and the relative position and attitude parameters were determined using the mathematical formulation of the detected features. The relative position and attitude data were considered real-time navigation data and accomplished with the Clohessy–Whiltshire relative motion equations to estimate the rendezvous trajectory of the ATV to the international space station.
In addition to deep learning algorithms, several methods did not attempt to process the entire image; instead, they utilized the image only for feature detection. Liu et al. [
9] applied the edge detection algorithm to extract meaningful features from the images of a cylinder-shaped spacecraft. The ellipses, which are obtained from arc detection, were employed to estimate spacecraft poses by the manipulation of shape, size, and position of features. For a similar method, Aumann [
10] developed a pose estimation algorithm using Open source Computer Vision (OpenCV) to detect two longitudinal lines on the sides of a cylinder-shaped object. Then, the author manipulated the positions, directions, and parallelism of the two lines to acquire the pose of the cylindrical object. Sharma et al. [
11] employed a Gaussian filter to detect edge lines and cutting point of the spacecraft in 2D images. Later, using principles of spacecraft kinematics, they manipulated the governed points and lines via the efficient perspective-n-point (EPnP) method to solve the 3D parameters from 2D images. Kelsey et al. [
12] developed the Vision System for the Autonomous Rendezvous and Docking (VISARD) algorithm by implementing a model-based technique and edge detection for image preprocessing. For pose refinement, the researchers employed Iterative Reweighted Least Squares (IRLS) to estimate the motion of the model. The research also applied a tracking algorithm and used an Extended Kalman Filter (EKF) to predict the model pose. Nevertheless, all the prior studies have some implementation limitations. For example, the edge lines of an object with a complicated shape leads to complexities in the mathematical formulation. As a result that numerous points and lines are detected, in harsh lighting conditions, the feature detection performance may be reduced. Transfer learning is a technique to train the machine learning model using a learning agent, which contains the knowledge of a related task. This accumulated knowledge is theoretically able to accelerate learning with a similar task [
13]. Therefore, to reduce the implementation complexities, transfer learning using a pretrained model as a learning agent is preferable for constructing the pose estimation algorithm.
Image regression through deep learning algorithms has been widely applied to pose estimation model construction, and the basic algorithms and mathematical models have been developed in several works. The regression method demonstrated in [
14] derived equations for constructing convolutional neural network models. This study used various orientation estimation models for rotation in different dimensions. According to the methodology, the estimation algorithms for viewpoint estimation, surface-normal estimation, and 3D rotation have different rotation parameters and operations. Spacecraft usually behave as 3D rotating objects. Thus, the implementation of spacecraft pose is beyond the determination via Euler angles, as shown in surface-normal estimation. Instead, quaternions are required to represent the object’s angle of rotation.
Many public datasets contain images with labeled data that are positioned and oriented in a representation of quaternions. Proença and Gao [
15] generated a dataset by using Unreal Rendered Spacecraft On-Orbit (URSO). The tool proposed in that study is a simulator built in Unreal Engine 4 that creates realistic images of the spacecraft surroundings by mimicking the appearance of outer space. The generated images can visualize these outer space conditions for the spacecraft under harsh lighting conditions and can use realistic earth-surface images as the background. They also demonstrated a method that uses a ResNet architecture based on a pretrained CNN model as a backbone. This method achieved high accuracy but also has high complexity. Consequently, it consumes large amounts of computational resources.
Another previous work on spacecraft datasets can be found in [
16], which introduced the spacecraft pose network (SPN), a custom CNN whose architecture includes three separate branches. They trained the CNN model using a public dataset, Spacecraft PosE Estimation Dataset (SPEED) and estimated the six degrees of freedom parameters separately. The position was estimated from a 2D bounding box on a target detected by one branch of the CNN model using the Gauss–Newton algorithm. The relative attitude was determined directly from the other two branches using a hybrid discrete-continuous method. Although the custom convolutional network is beyond the research scope, it provides a significant contribution and performs estimation using separated parameters classification and regression method.
Kendall et al. [
17] presented a deep neural model that employed a convolutional neural network to perform pose estimation for a camera. The dataset preparation process considers the pose as parameters relative to the scene and a practical algorithm for pose estimation is developed. The researchers implemented this process using a modified GoogLeNet architecture, which is a CNN model developed by Google. The pose estimation model was initially trained with interior data and subsequently required less outdoor data to train the model. Moreover, it was successful at performing relocalization and pose prediction from the camera images. Although spacecraft attitude estimation must be manipulated using data regarding the spacecraft’s position and orientation relative to the camera rather than calculating the pose of the camera itself, the principle is still applicable. Artificial intelligence (AI) studies are concerned with constructing correlations between input and output data.
Mahendran et al. [
18] developed a pose estimation algorithm for single objects using a pretrained VGG-M model. Using the Pascal 3D+ dataset, the training process adopted geodesic distance as the loss function. The next year [
19], used a ResNet-50 model as the base architecture and demonstrated the use of various loss functions such as simple/naïve, log-Euclidean, geodesic, and probabilistic loss on the same dataset (Pascal 3D+). Another work involving single object detection in [
20] applied state-of-the-art AI methods to medical science. They implemented CNNs to estimate six degrees of freedom, including the position and attitude of the human brain, from MRI scans. The pose estimation model was constructed using a ResNet18 model, which reduced the required size of the training dataset. In the training stage, the position loss was the mean-squared error, while the orientation loss was the geodesic distance. In addition, some works have addressed multiple-object pose estimation, such as [
21,
22,
23,
24]. The contributions from these works could be applied to multiple-object detection in space. For example, in situations where multiple objects need to interact, such a pose estimation system may need to estimate the poses of the various objects individually. Such situations might include space debris collection or vision-based docking operations involving multiple detected objects. Although this research concerns one spacecraft detection, the multiple object detection task could be applied to future works of advanced aerospace image sensing. Due to the lack of data samples for multiple space objects, pose estimation for a single object is more applicable. In the many prior works, different applications have been implemented by different techniques. However, the efficiencies of pretrained models have been evaluated by many research works. Based on that information, the base pretrained model was selected with respect to high efficiency and minimal computational resource consumption.
Another consideration of this research is the formulation of the loss function. Various works have used different formulations to address the terms of position loss and orientation loss. For position loss, most of the works implemented mean squared error [
20,
24] as the loss function. However, some research was successful using the Euclidean distance [
15,
17] and a multiplication of the scaling coefficient, as shown in Equation (1).
where
is the trial position vector from the layers of the CNN model and
is the ground-truth position vector available in the dataset. In many studies [
14,
18,
20], the orientation loss was formulated as the geodesic loss in Equation (2).
where
is the trial quaternion component extracted by the layers of the CNN model and
is the ground-truth quaternion component, which is available in the dataset. The total loss defined in Equation (3) is a summation of Equations (1) and (2)
To minimize the prediction error, the scaling factors and must be optimized. Using the most straightforward method, and could be fine-tuned by trial and error.
5. Conclusions
In a spacecraft docking operation, the position and attitude of the target spacecraft must be determined, and a sensor must exist that can obtain those operational parameters. Currently, vision-based algorithms have been developed concurrently with image processing using deep learning algorithms. The goal of this study was to construct a position and attitude estimation model using a deep neural network. The vision-based detection technique has the advantage that it is applicable for both cooperative space objects and non-cooperative space objects. In the implementation, the pose estimation model was constructed based on a state-of-the-art CNN model, which is a modified version of GoogLeNet that forms a general pose estimation model. Then, the model was trained on a simulated public dataset of the Soyuz spacecraft. Subsequently, the model was fine-tuned by repeated training using different mathematical expressions to achieve maximum accuracy. The exponential-based model resulted in high position estimation accuracy but poor orientation estimation accuracy. Thus, the pose estimation model was rebuilt using a different loss function and additional training iterations. With support from the National Astronomical Research Institute (NARIT), we were able to overcome the computational resource limitations. The final weighted Euclidean pose estimation model successfully achieves moderately high predication accuracy.
Under the harsh lighting conditions of outer space, the target spacecraft may not be completely visible in images. Therefore, the vision-based model must detect the target spacecraft with less consideration of the reflection of directional light and the planet surface. A model’s performance is strongly dependent on its architecture and on the training procedures. Highly accurate performances are usually obtained from the pose estimation model based on complex pretrained models. However, this study indicated that a convolutional neural model with low complexity can perform at moderately high efficiency when estimating spacecraft position and attitude. Nevertheless, although the complete model of this research resulted in high efficiency, a real-world spacecraft docking operation requires greater position and attitude accuracy and reliability from an estimation system.
Future research should target achieving higher prediction accuracy. Such a model could be constructed using a high-performance pretrained model such as VGG, Inception, DenseNet, or ResNet, whose architectures include deeper layers of neurons. When computational resource are unlimited, a position and attitude estimation model can be constructed by repurposing a high complexity pretrained model. Currently, cluster computing and cloud computing are excellent choices for reducing the computation time during model construction, but accessing the compute nodes may be costly.
The auxiliary algorithms are also an excellent choice for reducing the prediction error of a pose estimation model. Many studies have manipulated the detected points and lines of interest to extract position and attitude parameters from input images [
3,
9,
10,
11,
12]. These contributions provide ideas for future work. Point and line detection can be performed with lower complexity than vision-based CNN algorithms such as OpenCV; then, feature detection can be conducted using an auxiliary deep learning model. The outputs of those algorithms are numerical data that can be combined with image data into a mixed form of input.
In an actual spacecraft docking operation, the spacecraft is in dynamic motion rather than static as in a single 2D image. Moreover, the operation involves both detecting and tracking the spacecraft. Therefore, a vision-based position and attitude estimation model can be applied to the state estimation algorithm or available techniques [
12]. The principle of state estimation has been widely applied in the field of spacecraft dynamics and control. For example, the Kalman filter is an elementary state estimation algorithm that combines state prediction using a physics-based model with measurement during the update stage. A vision-based estimation system could be used as the measurement model for spacecraft tracking. Hypothetically, the error of the position and attitude estimation model could be corrected by the physics-based model during an actual docking operation with spacecraft in motion.
In summary, to satisfy a real docking operation, many auxiliary algorithms are recommended for future research to increase the performance of the vision-based position and attitude estimation model. The particular characteristic of the vision-based CNN model is that it is very specific to the environment of the dataset. For example, the model that was trained with the simulation data will perform a satisfactory estimation of this synthetic dataset. However, to address the real docking operation, the constructed model with the knowledge of spacecraft pose estimation can be hypothetically trained further with the data from actual operation [
27]. With this feasibility, the vision-based CNN pose estimation model could be trained with real photos to be practical and provide reliability to the actual spacecraft docking in the near future. Moreover, the advanced state estimation algorithm combined with vision-based detection could be a critical factor in achieving higher efficiency in spacecraft motion prediction with regard to actual space interactions.