1. Introduction
The satellite tracking, telemetry, and command (TT&C) system is utilized to transmit the telemetry and telecommand data and determine the satellite orbit [
1]. With the development and construction of prominent satellite constellations such as Hongyun and Hongyan in China [
2], the number of satellites in space has rapidly increased [
3,
4], resulting in an increasing TT&C demand for space networks. On the other hand, due to the long construction period, high cost, and constraints of national borders, the number of ground stations is relatively limited, resulting in a lack of available TT&C resources. Therefore, more severe resource conflicts will be encountered in TT&C task planning, which also brings more significant challenges to the TT&C resource management of space networks.
The TT&C tasks can be divided into two categories: common tasks and emergency tasks. Common tasks [
5] request the daily tracing of satellites in space and form a majority of the TT&C tasks. Emergency tasks [
6] are the real-time tasks that take place during emergencies. As satellites play an increasingly important role in disaster relief and emergency search and rescue, more and more emergency TT&C tasks are emerging. During the planning stage, the emergency task often needs to be inserted into the idle time of the TT&C equipment or preempt the planned resources that are allocated to common tasks. However, the existing traditional TT&C task-planning methods only aim to maximize the sum revenue of the completed tasks, which generates many short time intervals (i.e., time fragments) between the TT&C windows, which cannot be used for planning tasks, significantly affecting the planning of emergency tasks. The resources of common TT&C tasks are preempted so that the common TT&C tasks cannot be successfully executed. Therefore, it is necessary to study the TT&C task-planning method toward low time fragments, which facilitate the planning of emergency TT&C tasks to reduce the response time of emergency TT&C tasks.
The existing methods for common TT&C task planning mainly include deterministic algorithms [
7,
8,
9], heuristic algorithms [
10,
11,
12], and intelligent algorithms [
13,
14,
15,
16,
17]. Among them, the deterministic algorithm is mainly suitable for solving TT&C task scheduling problems on a small scale, but it is difficult to apply this algorithm to solve problems on a large scale. The heuristic algorithm is mainly based on certain heuristic rules to obtain the optimal solution within a certain time. However, the heuristic rules of this type of algorithm are often derived from engineering experience and have great uncertainty. Intelligent algorithms seek optimal results while taking into account efficiency and accuracy. With the gradual increase in satellites in recent years, traditional deterministic and heuristic algorithms have not been the most suitable for solving TT&C task planning problems. Therefore, most of the current research focuses on the research into intelligent algorithms. References [
13,
14,
15] all established mathematical models of TT&C resource planning models to minimize the workload and use different ant colony algorithms to solve them. Reference [
16] designed an emergency task-planning algorithm based on a simulated annealing algorithm with the goal of maximizing the total task revenue. The algorithm reduces the probability of falling into local optimum through an iterative neighborhood search and instability strategy to realize the reasonable allocation of multi-dimensional TT&C resources. Reference [
17] established a mathematical model of TT&C task planning with a minimized overall task satisfaction rate as the optimization goal. It proposed an improved DQN (Deep Q-Network)-based TT&C resource optimization configuration method to effectively solve the joint planning problem of heterogeneous TT&C network resources. This algorithm effectively improves the satisfaction rate of the TT&C tasks.
Although, References [
13,
14,
15,
16,
17] used intelligent algorithms to improve the completion rate of TT&C tasks effectively, which do not control the time fragments of TT&C equipment during the task-planning process, resulting in untimely responses to emergency tasks. Therefore, this paper designs a low time fragmentation-oriented TT&C task-planning method (LTFTT&C) based on DDQN (Double Deep Q-network) to minimize time fragmentation and maximize task revenue. The algorithm uses deep reinforcement learning to mine the distribution rules of TT&C task resources of the space network. It constructs a common TT&C task feasible scheme that considers the task completion rate and emergency task response speed. Unlike other traditional TT&C task-planning algorithms, LTFTT&C can reduce the time fragments in the task-planning process while ensuring the success rate of the task. The basic idea is as follows: Firstly, aiming at the problem of high time fragmentation and low utilization rates of TT&C equipment in the existing TT&C task-planning results, a mathematical model of TT&C task planning is developed with the objective function of minimizing the total time fragments and maximizing the total task revenue. Then, based on the conflict graph model, the task-planning problem is transformed into a maximum independent set problem. Finally, the task-planning algorithm based on DDQN and the mathematical model are used alternately to complete the entire solution process. Comparing the LTFTT&C algorithm with the existing methods, the simulation results show that the proposed algorithm in this paper reduces the time fragments of TT&C equipment while ensuring the revenue of common TT&C tasks and effectively improves the response speed to subsequent emergency tasks.
The remainder of this paper is organized as follows:
Section 2 gives a description of the TT&C task-planning scenario and problem.
Section 3 has established the mathematical model of the TT&C task-planning problem in a space network.
Section 4 transforms the TT&C task-planning problem into an independent set problem on the graph.
Section 5 introduces a low time fragmentation-oriented TT&C task-planning algorithm.
Section 6 conducts simulation experiments to verify the effectiveness of the proposed algorithm. The last section gives the research conclusion.
2. Scenario and Problem Description
The TT&C scenario shown in
Figure 1 is composed of satellites and ground stations. The set of satellites is
, and
represents the
i-th satellite. The set of ground stations is
, and
represents the
h-th ground station. Each ground station has
L TT&C equipment. The set of all the TT&C equipment is
, and
represents the
j-th TT&C equipment. There are
visible windows between satellite
and TT&C equipment
in the planning horizon, and the set is
.
represents the
k-
th visible window between satellite
and TT&C equipment
in the planning horizon. For each visible window
, it can be expressed as
, where
and
are the start time and end time of the visible window.
is the sign of orbit ascending or descending.
is the set of all the visible windows of satellite
between all the TT&C equipment. The set of all visible windows in the planning horizon is expressed as
.
During the planning horizon, all the N satellites in the space network need to be tracked, telemetered and commanded as required. represents the TT&C requirements of the i-th satellite. Therefore, the space network has N TT&C tasks, and their set is denoted as . For each , it is represented by a four-dimensional tuple , where represents the ID of the satellite to be TT&C in task . represents the requirement of the TT&C window, i.e., the number of TT&C windows required and the request of the orbit ascending or descending for task . indicates the type of TT&C equipment, that is, the function of the TT&C equipment required by task . represents the revenue bought by completing task .
As shown in
Figure 2,
,
, and
are three scheduled windows of different tasks performed by the same equipment. A relatively short time interval, i.e., a time fragment, is easily formed between adjacent scheduled time windows of the same TT&C equipment. Specifically, two time intervals with lengths
and
are, respectively, formed between the three consecutive time windows
,
, and
of the TT&C equipment
. The length of
is shorter than the threshold
Th, and the length of
is longer than the threshold
Th. Therefore,
forms a time fragment of equipment
. In the planning horizon, if the interval between two adjacent time windows executing tasks in the same TT&C equipment is less than the threshold Th, a time fragment
is formed, and the set of time fragments is
.
The essence of the TT&C task-planning problem of the space network is to allocate suitable visible windows of the suitable equipment to each satellite need to be tracked, telemetered and commanded. In this way, the TT&C equipment can track, telemetry, monitor, and control the satellite flight orbit, attitude, and operational status of the onboard subsystems. The core problem studied in this paper is to maximize the total revenue of TT&C tasks while minimizing the total time fragments after the execution of TT&C tasks without exceeding the capacity of network resources.
3. Mathematical Model
During the execution of the TT&C task, the transceiver equipment between the satellite and the ground station can only perform one task at a time. Therefore, there may be conflicts in the simultaneous scheduling of different visible windows. For any two visible windows , they conflict with each other if both of the following conditions are satisfied:
- (1)
There is time overlap between visible windows and , i.e., or .
- (2)
Windows and occupy the same satellite or TT&C equipment, i.e., i = m or j = n.
As shown in
Figure 3, the start time of window
is before the end time of window
, and the end time of window
is after the start time of window
. The overlapping part of the two windows together uses the same equipment to cause a conflict.
A TT&C task of a satellite requires multiple ascending or descending orbit TT&C windows. The combination of visible windows that meet the requirements constitutes a feasible scheme for the TT&C task. In this paper, represents the α-th feasible scheme of task , so the set of feasible schemes for the task is , and the set of feasible schemes for all TT&C tasks is . For example, the window requirement of task is two ascending orbit windows and two descending orbit windows. The α-th feasible scheme of TT&C task is , and , where , , , and are, respectively, the first ascending orbit window, the first descending orbit window, the second ascending orbit window, and the second descending orbit window in the α-th scheme of the TT&C task.
Since different visible windows between satellites and TT&C equipment may conflict, there may also be conflicts between different feasible schemes. If different feasible schemes contain the same window or if there is a conflict between their visible windows, there is a conflict between the feasible schemes. Therefore, for any two feasible schemes , a conflict between them must meet one of the following two conditions:
(1) has and .
(2) , there is a conflict between window and window .
Use to represent the set of all feasible schemes that conflict with . The Boolean variable is used to indicate whether the α-th feasible scheme of task is scheduled. Specifically, means that the α-th feasible scheme of task is scheduled. Otherwise, means that the α-th feasible scheme of task is not scheduling.
The goal of TT&C task planning is to allocate appropriate time windows for all TT&C tasks within the planning horizon so that the total revenue of TT&C tasks is maximized and the time fragments generated by them are minimized. The following conditions must be met when planning the TT&C tasks:
(1) A TT&C task can only be completed once by a TT&C feasible scheme.
(2) A TT&C scheme and its conflicting scheme cannot be scheduling at the same time.
The low time fragmentation-oriented TT&C task-planning problem is modeled as the following multi-objective optimization problem:
Subject to the following:
where C1 is the unique constraint of tasks, which guarantees that each TT&C task can only be completed once at most. C2 is the conflict constraint of the TT&C scheme, i.e., a TT&C scheme and its conflicting schemes cannot be scheduling simultaneously.
4. Problem Transformation
Observing the above optimization problem P, we can see that this problem is a multi-objective combinatorial optimization problem and has been proven to be NP-hard [
18]. To solve the TT&C task-planning problem more conveniently and effectively, this paper first transforms the problem into a graph theory problem, and then uses machine learning to solve the problem. Specifically, a task-planning conflict graph
is constructed, and the vertex
in the graph represents the task feasible scheme
, i.e.,
. The edges in
CG represent the conflicted relationship between different feasible task feasible schemes. For any
, a conflict between them must meet one of the following two conditions:
- (1)
and present the two different feasible schemes of the same task, i.e., i = n.
- (2)
There is a conflict between the feasible schemes corresponding to and . That is to say, there is for .
It means that there is a conflicted relationship between two vertices, namely, .
The above two conditions correspond to the constraints C1 and C2 in problem P. C1 represents the conflicted relationship between different feasible schemes of the same TT&C task. C2 indicates the conflicted relationship between feasible schemes of different TT&C tasks. Therefore, solving the optimization problem P is equivalent to solving the independent set problem on the graph.
5. Low Time Fragmentation-Oriented TT&C Task-Planning Algorithm
5.1. Graph Learning-Based Framework
Section 4 has transformed the TT&C task-planning problem into the problem of solving the independent set in the task-planning conflict graph. Based on this, this section combines deep learning and graph theory to propose a problem-solving framework, as shown in
Figure 4. In this framework, the set
IS represents the local solution of the independent set problem, which is a subset of set
V. The Q evaluation function is obtained through reinforcement learning training.
As shown in
Figure 4, we first construct a set of feasible schemes. On this basis, we construct a task-planning conflict graph and calculate the Q values of all vertices in the graph. Then, according to the Q evaluation function trained by deep learning, we select the vertex with the largest Q value. Additionally, we add this selected vertex to the current local solution
IS. Afterward, we deleted the vertex from the task-planning conflict graph, and the task-planning conflict graph is updated. Lastly, we repeat the above process until there are no remaining vertices in the task-planning conflict graph. The current local solution
IS is the optimal independent set.
5.2. Feasible Scheme Set Construction
A TT&C task may have many feasible schemes, which increase the complexity of TT&C task planning of space network. To better solve the TT&C task-planning problem of space networks, this paper constructs the feasible scheme set of all tasks in advance. Specifically, we model the feasible planning orders for visible windows of the same task as a directed graph model. Then, this paper transforms the problem of constructing the set of feasible schemes into a path searching problem in directed graph. Finally, we use the traversal search algorithm to find all the feasible schemes of all TT&C tasks.
Let directed graph Gi(Vi, Ei) represent all visible windows of satellite as well as their relationship in a planning horizon. Vertex set in the graph represents the vertex set corresponding to all visible windows between satellite and the TT&C equipment within a planning horizon. The start vertex B and the end vertex E are two virtual visible windows of the TT&C task . The directed edge of directed graph represents the sequential scheduling relationship between two visible windows for task . For instance, windows are two visible windows for task . If there is no overlap between the two windows and the ascending or descending orbit alternate condition is satisfied, the two windows can be scheduled sequentially for task . That is to say, in the graph represents the connection relationship between vertices, which is defined as follows:
- (1)
, exists , and ;
- (2)
, if the two windows do not overlap and one is an orbit ascending window and the other is an orbit descending window (without loss of generality, we assume that ), then .
As shown in
Figure 5, vertex
B and vertex
E represent the virtual start window and virtual end window, respectively.
,
,
,
,
,
,
, and
represent different visible windows, respectively. A path that satisfies the number of windows and the window of ascending or descending requirements represents a feasible scheme within the planning period. Since the aim of TT&C tasks of space networks is to track, telemeter, and command satellites, different tasks are independent of each other in the construction of feasible schemes. That is to say, the directed graphs of each satellite are independent of each other, and the problem of constructing feasible schemes is transformed into the path problem of each independently directed graph. This paper uses the traversal search algorithm on the graph to complete the construction of the feasible scheme set [
19,
20,
21].
5.3. Vertex Attribute Extraction and Parametric Design of Q Function
Define the vertex attribute
of the task-planning conflict graph, where
is the degree of vertex
v in the current task-planning conflict graph
, and
is the number of vertices in
that correspond to the same task as vertex
v in
, namely,
where the operator
is the number of elements in the set.
represents the newly added time fragments after the feasible scheme represented by vertex
v is added to the current local solution.
is the revenue obtained after the task feasible scheme corresponding to vertex
v is completed.
In order to simplify the symbolic expression,
v is used in the following text to express vertex
in the task-planning conflict graph. Graph structure embedding is carried out recursively. In each step of the recursive process, information is embedded for each vertex of the graph, so that it contains a larger range of graph structure and vertex attribute information. By embedding different degrees of information on each vertex, the computation can be effectively reduced and the algorithm efficiency can be improved. In particular, in the recursive process, a q-dimensional vector
is used to represent the embedding vector of the vertex
v obtained after the
k-th iteration. The initial value of the embedding vector of vertex
v is set to 0.
It can be seen from Equation (4) that the updating process of graph structure embedding is based on the adjacency relationship of vertices in the graph, where is the vertices adjacent to the vertex v in the graph. and are the model parameters, and relu() is the rectified linear unit. Based on Equation (4), one can see that the embedding update process is carried out based on the graph topology. A new round of embedding sweeping across the vertexes will start only after the embedding update for all nodes from the previous round has finished. The above process goes through a total of K iterations, and each vertex obtains an embedding vector .
After the
K-step graph embedding information is completed, the embedding vector is defined as the evaluation function
.
and
are used to represent the graph
CG and vertex
v, respectively, and
is defined as follows:
where
,
, and
are the parameters to be trained for the model, and [.,.] is the concatenation operator. The parameters
and
are calculated by Equation (4). They are affected by the parameters
and
in the embedding process of the graph structure. So the evaluation function
is determined by five parameters
.
Algorithm 1 shows the specific calculation steps of the evaluation function
, where
needs to be recalculated according to the task-planning the subgraph of task planning conflict graph at that time in each iteration. The parameters
are obtained through deep Q-learning training.
Algorithm 1 Calculation of evaluation funtion |
1: Initialize xv |
2: for k = 1 to K do |
3: for v∈V do |
4: |
5: end for |
6: end for |
7: Return |
5.4. Evaluation Function Based on DDQN Network Training
As a typical representative of the value-based algorithm in the reinforcement learning algorithm, the DQN algorithm combines deep learning and reinforcement learning to predict the Q value. Q value can be understood as the value of state action, i.e., the expected revenues of the agent through acting in a certain state. Although the DQN algorithm has achieved good results in deep reinforcement learning, it sometimes overestimates the behavior value too optimistically. This will cause the algorithm to unanimously regard the suboptimal behavior value as the optimal behavior value, eventually failing to converge to the best value function. However, the calculation method of the target value of DDQN and DQN is different. In DQN, the largest Q value in the target network is directly selected as the target value. Additionally, DDQN calculates the Q value according to the parameters of the current Q network. Therefore, the Q value in DDQN is less than or equal to the Q value in DQN. This reduces the overestimation to a certain extent, making the Q value closer to the real value.
Therefore, in this section, solving the independent set in the task-planning conflict graph is modeled as a DDQN learning process. Additionally, the Q evaluation function has been trained. This paper defines the state, action, and reward under the reinforcement learning framework. The specific content is as follows:
State: The state in reinforcement learning is the local solution IS. Because the impact of local solution on the subsequent independent set construction process is equal to the task-planning conflict graph in the iteration process, the q-dimension embedded vector of the task-planning conflict subgraph is used to represent the current state.
Action: The action v is a vertex that does not belong to the current local solution IS. Additionally, the action v is expressed as a q-dimensional embedding vector .
Reward: This paper aims to minimize time fragments and maximize task revenue. Therefore, the designed reward is the revenue of vertex
v minus the total time fragment’s increment after vertex
v is added to the current local solution
IS, namely:
where
is the total time fragment generated by adding the feasible scheme represented by vertex
v to the set of feasible schemes corresponding to the current solution
IS, and
λ∈(0,1) is a constant.
The evaluation function gives the value of action v in a given state IS. DDQN is used to optimize parameter of the evaluation function in the reinforcement learning process.
In this paper, two DDQN neural networks are constructed to train model parameters. i.e., target network and evaluation network. The evaluation network is used to select an action, and the target network is used to evaluate the value of the action. The two networks update parameters alternately with each other. Specifically, target network has parameter
and evaluation network has parameter
are established. Additionally, the parameter
is updated by minimizing the square loss through the gradient descent method:
During the training process, the neural network copies parameter
to parameter
every several step. In DDQN, the target value is defined as
where
γ is the Q-learning decay coefficient.
When constructing an independent set, the reward obtained for each additional vertex is not only from the successful planning of the task represented by this vertex, but also from subsequent task rewards in non-conflicting task planning. For this delayed reward scenario, the Q function updated at each step is too short-sighted to reflect the long-term revenue situation well. Therefore, this paper uses the
n-step Q-learning method to update the parameters. Unlike single-step learning,
n-step Q-learning needs to wait
n steps for parameter update. Only in this way can the Q-function provide a more precise evaluation of future rewards. The parameter update method is the same as the single-step learning (Equation (7)). Therefore, there is a difference in the calculation of the target value:
Moreover, there is a strong correlation between the continuous actions of the independent set construction process. In order to reduce the impact of the correlation between continuous samples on neural network training, this paper uses experience replay technology to speed up the convergence speed of neural network and improve the robustness of training results. Specifically, samples are continuously added to the data set at each step in every round. For example, in step , the tuple is added to , where .
In summary, we find that there is a reward delay problem during independent set construction and that there is a strong correlation between consecutive samples. Based on this, we first address the reward-delay situation during independent set construction using n-step Q-learning. Then, we use the experience replay technique to reduce the correlation between consecutive samples. At last, this paper designs an independent set algorithm based on double deep Q-learning to complete the training of parameter
, as shown in Algorithm 2.
is a task-planning conflict graph constructed from a set of tasks drawn from the sample space.
Algorithm 2: Evaluation function training based on DDQN learning |
1: Initialize experience replay storage space . |
2: for round e = 1 to L do |
3: Extract a task set and construct the corresponding task-planning conflict graph |
4: Initialize τ = 1, IS1 = Ø, and random parameters , . |
5: while : |
6: . |
7: Add vertex into the local solution: |
8: Delete vertex and its neighbor vertices from . |
9: if then |
10: Add tuple to . |
11: Sample random batch from B~. |
12: Use the samples in B to train and update the value of based on the gradient descent method. |
13: Replace target parameters every c steps |
14: end if |
15: |
16: end while |
17: end for |
18: return |
6. Simulation
In this paper, STK11.3 software is used to build a scenario of space networks. Tensorflow1.15 is used to build a neural network, and Python3.7 is used to build the algorithm main program. The space network scenario configuration is as follows:
- (1)
A total of 150 low-orbit satellites are located in 150 different orbits. The orbital altitudes are 500 km, 650 km, and 800 km. Inclinations are evenly distributed within the range of 96 to 100 degrees.
- (2)
There are two ground stations (Xi’an Station and Kashi Station in China), each of which has three sets of TT&C equipment.
The task-planning horizon is from 2022-8-27 04:00:00 to 2022-8-28 04:00:00. During the experiment, each TT&C task has two to six different TT&C window requests of the orbit ascending or descending. A total of 150 groups of tasks are randomly generated, of which 120 are used as training sets and 30 as test sets. The time fragment length threshold Th is set to 5 min during the experiment.
Figure 6 shows the convergence of training the function Q network at different learning rates. It can be seen that the larger the learning rate, the faster the Q-function converges. When the learning rate is 0.05, the loss function converges the fastest, but the result quickly falls into the local optimum. When the learning efficiency is 0.0005, the convergence speed of the loss function is too slow, resulting in too long a training time to achieve a good training effect. Therefore, this paper uses a learning rate of 0.005 to train the Q evaluation function. Because this can not only prevent the loss function from converging too slowly but also prevent the loss function from falling into a local optimum to a certain extent. Finally, the best training result is achieved.
In order to verify the performance of the algorithm proposed in this paper, this paper compares the LTFTT&C algorithm with the randomized adaptive search procedure (GRASP) [
22] and genetic algorithm (GA) [
23]. It mainly conduct performance comparisons in terms of the total time fragments in different TT&C task scenarios, average response time of emergency tasks, and total task revenue.
Figure 7 compares the total time fragments of the TT&C equipment of the three TT&C task-planning algorithms under different task numbers. It can be seen that, compared with the other two algorithms, the LTFTT&C algorithm significantly reduced the total amount of time fragments of equipment under different task numbers. Compared with the GRASP algorithm, LTFTT&C reduced the total time fragments of TT&C equipment by 48%. Compared with the genetic algorithm, LTFTT&C reduced the total time fragments of TT&C equipment by 32%. To further verify the LTFTT&C algorithm to reduce the gain of time fragmentation on emergency task planning, the emergency task test response time (i.e., the difference between the emergency task completion time and arrival time) was inserted based on the three algorithms for common TT&C tasks. As shown in
Figure 8, with the increase in the number of emergency tasks, the average response time of the emergency tasks of the three algorithms increases to varying degrees, and the LTFTT&C algorithm is obviously better than the other two algorithms. LTFTT&C is 43% faster than the GRASP algorithm and 36% faster than the genetic algorithm in the average response time of the emergency tasks. This is mainly because the LTFTT&C algorithm effectively reduces the amount of time fragments, making it easier for emergency tasks to insert into suitable visible windows.
Figure 9 shows the performance of the three algorithms regarding the total task revenue. It can be seen from the figure that when the number of tasks is small, the LTFTT&C algorithm is significantly better than the other two comparison algorithms in terms of task revenue. It is 21% and 15% higher than the GRASP and genetic algorithms. When the number of tasks is large, the LTFTT&C algorithm improves the total task revenue by 28% compared with the GRASP algorithm. It decreases the total task revenue by 7% compared with the genetic algorithm. Combining
Figure 7 and
Figure 9, compared with GRASP, the LTFTT&C algorithm adopts the deep reinforcement learning method, which not only improves the revenue but also reduces the time fragment. Compared with the genetic algorithm, the LTFTT&C algorithm sacrifices a small amount of revenue in exchange for a significant decrease in time fragmentation, thus effectively reducing the response time of emergency tasks.
7. Conclusions
In this paper, we proposed a low time fragmentation-oriented TT&C task-planning method to reduce the total time fragments of TT&C equipment and improve the response time of emergency tasks. Firstly, this paper established a mathematical model with the optimization goal of minimizing the total time fragments and maximizing the total revenue of TT&C tasks. Then, we transformed the problem of TT&C task planning into an independent set problem in the graph, and introduced DDQN in deep reinforcement learning as an evaluation strategy. Afterward, we designed a low time fragmentation-oriented TT&C task-planning method to improve the service capability of TT&C resources. The simulation results of multiple experimental scenarios showed that the LTFTT&C algorithm is significantly better than GRASP and GA algorithms in terms of time fragmentation. Therefore, using the LTFTT&C algorithm to solve the TT&C task-planning problem is reasonable and effective.