1. Introduction
The internet of things (IoT) is considered an important emerging technology that can change human life and has brought great convenience to human society [
1]. Internet of things applications, such as smart cities [
2], smart agriculture [
3], maritime detection [
4], augmented reality (AR)/virtual reality (VR) [
5], etc., have made great progress and raised new demands for future network development [
6]. It is predicted that by 2027, more than 30 billion IoT devices will be deployed around the world [
7]. Limited by geographical location, it is difficult for traditional terrestrial networks to provide reliable services for IoT devices in maritime areaa, remote areas, and other areas without terrestrial network coverage [
8]. Therefore, non-terrestrial networks, unaffected by geographical limitations, can provide reliable wide-area coverage for terrestrial IoT devices, becoming a crucial direction for future network development.
According to the different deployment heights of non-terrestrial network nodes, non-terrestrial networks can be divided into satellite networks and space-based networks. Satellite network nodes are deployed at altitudes ranging from 160 km to 35,786 km above the ground [
9]. In the past 20 years, satellite communication networks represented by low-orbit satellite constellations have experienced tremendous development. Typical low-earth orbit (LEO) satellite constellations include SpaceX’s Starlink constellation system [
10], the Iridium constellation [
11], the OneWeb constellation [
12], the Telesat constellation [
13], etc. By deploying edge servers, satellite network nodes can directly provide task processing services to terrestrial IoT devices. However, due to the high flight altitude of satellites and the limited energy and signal transmission power of a large number of terrestrial IoT devices, terrestrial IoT devices face challenges in sending data to satellites stably and quickly.
Different from satellite networks, space-based networks represented by high altitude platform (HAP) drones are deployed at an altitude of several hundred meters to 20 km above the ground [
14]. Their flying heights are much lower than satellite networks. Therefore, ground IoT devices can transmit data to air nodes more stably and at a higher speed. According to the definition of the International Telecommunication Union (ITU), a HAP drone is unmanned aerial vehicle (UAV) deployed at an altitude of 20 km above the ground [
15,
16]. It can either hover to provide stable communication services for ground equipment or be deployed mobile to respond to emergency service requirements. Since the distance from the ground IoT device to the HAP drone is much smaller than the distance from the ground IoT device to the satellite, the wireless channel from the ground IoT device to the HAP drone is more stable and has a higher channel gain. Therefore, the ground IoT device can communicate with the HAP drone at a higher rate. Similarly, HAP drones can carry edge servers to provide edge computing services to IoT devices.
However, since the load capacity of HAP drones and LEO satellites is limited, a single HAP drone and satellite cannot guarantee the provision of stable and effective task processing services for the terrestrial IoT devices it serves. Therefore, sharing the computing resources of the server through the collaboration of multiple non-terrestrial nodes has become a promising method to improve the service capabilities of non-terrestrial networks. By dividing the task into multiple sub-parts and processing them on servers in different locations, the network’s task processing capabilities can be effectively improved. However, for HAP drones and satellite collaborative networks, how to reasonably split tasks and optimize the allocation strategy of communication and computing resources have become important issues that need to be solved.
Inspired by these challenges, we investigate the problem of task splitting and resource allocation under the collaborative network of HAP drones and LEO satellites, as shown in
Figure 1. In this paper, we propose that HAP drones can not only cooperate with satellites but also with each other to jointly provide computing services for ground IoT devices. This solution is conducive to expanding the task processing capabilities of the double-layer network and improving the resource utilization of the network. Correspondingly, we considered factors such as the task processing tolerance delay of ground IoT devices, the limited computing resources of satellites, HAP drones, and IoT devices, and the limited communication resources between devices, and we formulated the problem of minimizing the total task processing delay.
In order to minimize the total task processing delay, we propose an intelligent online offloading and resource allocation algorithm based on a combination of deep reinforcement learning and convex optimization. We first decompose the original optimization problem into a task splitting optimization sub-problem and a resource allocation optimization sub-problem. For the task splitting optimization sub-problem, we designed an intelligent solving algorithm based on the DDPG algorithm. For the resource allocation optimization sub-problem, we used a convex optimization algorithm to solve it. Through joint optimization of the two sub-problems, the efficiency of task offloading and resource allocation can be improved, and the total task processing delay can be reduced. Compared with typical reinforcement learning solutions, the algorithm proposed in this paper does not use deep reinforcement learning methods to directly output task offloading and resource-scheduling strategy, which objectively reduces the scale of the neural network, thereby reducing the training difficulty of deep reinforcement learning and improving the efficiency of neural networks, and improving the convergence speed of the network. Finally, we conduct simulation experiments to verify the convergence and performance of the proposed algorithm. The results demonstrate that, under various scenarios, such as different task data sizes and different CPU cycles required per bit, the proposed algorithm can reduce the total task processing delay compared to baseline algorithms.
The main contributions of this paper are summarized as follows:
- (1)
We construct an edge computing framework for multi-HAP drones and multi-LEO collaboration. Under this framework, ground IoT devices’ tasks can be dynamically allocated to multiple HAP drones and LEO satellites for processing. In addition, the communication and computing resources of each node under this framework can also be dynamically allocated.
- (2)
Considering the task splitting constraints, available resource constraints, and the maximum tolerated delay constraint of the tasks, we construct a task splitting and resource allocation problem to minimize the total system delay. This is a non-convex continuous optimization problem.
- (3)
We propose a joint optimization algorithm of deep reinforcement learning and convex optimization. We design a task splitting optimization algorithm based on the deep deterministic policy gradient (DDPG) method and solve the optimal resource allocation strategy through a convex optimization algorithm. We design the structure of the actor network to ensure that the actions of DRL are effective.
- (4)
We verify the convergence and effectiveness of the algorithm proposed in this paper through experiments. By comparing the algorithm convergence under different discount factors and learning rates, we select reasonable neural network parameters. By comparing our algorithm with three other baseline schemes, we verify that the algorithm proposed in this paper can effectively reduce the total system delay.
The rest of the paper is structured as follows. The related works are presented in
Section 2. The system model of HAP drones and LEO satellite collaborative networks is introduced in
Section 3. In
Section 4, we formulate the problem. We introduce the task offloading and resource allocation algorithm proposed in this paper in detail in
Section 5. Numerical results are presented to verify the convergence and performance of the proposed algorithm in
Section 6. Finally, we conclude the paper in
Section 7.
2. Related Works
Recently, there have been many studies on edge computing for satellites. In [
17], the authors studied the edge cloud resource scheduling problem of space–air–ground integrated networks (SAGIN). The authors proposed an improved Two_Arch2 algorithm to optimize the resource-scheduling strategy of the SAGIN network to improve the service capabilities of the internet of vehicles. In [
18], the authors proposed a satellite edge computing network architecture that supports the IoT and constructed a multi-objective optimization problem that considers system delay, computing power, and energy consumption. The authors proposed a slicing-based scheduling strategy to optimize the offloading sequence and a number of offloading tasks. In [
19], the authors studied a hybrid LEO satellites and MEO satellite network for IoT. In order to solve the problem of satellite load imbalance, the authors formulated a joint optimization problem of computing and communication resources and proposed an optimization algorithm based on deep reinforcement learning to solve it. In [
20], the authors studied the service chain optimization problem in satellite edge computing scenarios. Aiming to minimize transmission delay, the authors designed two algorithms, an approximation algorithm and an online algorithm.
In addition, there are also studies on HAP drones carrying edge servers and providing edge computing services. Qiqi Ren et al. [
21] studied HAP drone and ground network collaboration to provide computing offloading services for ground transportation. Considering the joint optimization of cache, computing, and communication resources, the authors used multi-agent reinforcement learning and the Lagrange multiplier method to solve the problem, which effectively reduced task processing delay. In [
22], the authors introduced NOMA technology and deployed edge servers on HAP to provide computing offloading services for ground users. Taking into account power, transmission bandwidth, and maximum tolerated delay constraints, the authors proposed a transmission and deployment joint optimization algorithm based on successive convex approximation to minimize system energy consumption. In [
23], the authors also studied the HAP network that supports NOMA. The authors proposed a power control algorithm based on DDPG to reduce energy consumption and task processing delay. In [
24], the authors considered edge computing and wireless power transfer at the same time. In addition to offloading tasks to HAP drones for processing, ground IoT devices can also charge themselves through ground access points. The authors considered the problem of maximizing computing power while minimizing IoT energy consumption and designed a heuristic algorithm to solve the problem.
Research on edge computing for collaboration between HAP drones and satellites mainly focuses on resource scheduling between HAP drones and satellites. In [
25], the authors studied the task offloading and resource allocation problem in the scenario where HAP carries an edge server and cooperates with the LEO satellite network. The HAP drone can directly provide computing services to ground vehicles or forward data to the ground center for processing through LEO satellites. In [
26], the authors studied the application prospects of machine learning in resource scheduling problems in space–space–ground integrated networks. The authors designed an optimization algorithm based on a deep neural network to realize intelligent user scheduling in space–space–ground integrated networks. In [
27], the authors studied the edge computing network where HAP and satellites collaborate. Under this framework, users’ tasks can be processed collaboratively by HAP and satellites. The author proposes a task offloading and resource allocation algorithm based on block coordinate descent to improve network service capabilities.
At present, in edge computing research under HAP drones and satellite collaborative networks, HAP drones directly collaborate with satellites to provide computing services for ground users [
25,
26,
27,
28]. In this paper, HAP drones can collaborate not only with satellites but also with each other to jointly provide computing services to users on the ground. This method can expand the computing resources available to each user, and it is beneficial in reducing the delay in task processing.
In addition, many current works, such as [
23,
25,
29,
30], directly design reinforcement learning algorithms to output task offloading and resource allocation strategies. In this way, the output size of the neural network (including task offloading strategy and resource allocation strategy) is larger, resulting in a large network scale and a long training time. In this paper, we propose an optimization framework that combines deep reinforcement learning with convex optimization. Reinforcement learning only outputs task splitting strategies, and the optimization of resource allocation is realized by the convex optimization method, which can reduce the output size of neural networks and reduce the difficulty of network training.
3. System Model of HAP Drones and LEO Satellite Collaborative Networks
In this section, we consider the HAP drones and LEO satellite collaborative network, as shown in
Figure 1. In this network, there are
M LEO satellites flying at an altitude of 200 km [
10]. They carry edge servers to provide computing services to terrestrial IoT devices. In the stratosphere, 20 km above the ground,
N HAP drones are hovering. They also carry edge servers, so they can also provide computing services to terrestrial IoT devices.
We denote the collection of LEO satellites as . For LEO m, its computing capability can be denoted as . The set of HAP drones can be labeled as , and the maximum computing capacity of HAP drone n can be denoted as . There are J terrestrial IoT devices directly connected to different HAP drones that can offload their tasks to HAP drones. In each time slot, the input task size of IoT j can be modeled as , the computing density can be labeled as , and the maximum tolerable delay of this task is .
3.1. Communication Model
In this system, there are two channels to consider: the IoT–HAP drone channel and the HAP drone–LEO channel.
3.1.1. IoT–HAP Drone Communication Model
The channel gain between terrestrial IoT device
j and HAP drone
n can be modeled as [
28]
where
c is the speed of light, which is equal to
m/s.
is the distance between terrestrial IoT device
j and HAP drone
n.
is the carrier frequency of transmission signal.
is the attenuation gain, which is related to the environment. In this paper, we set the HAP drone’s antenna gain to 17 dBi [
31].
is the small scale fading component that satisfies Rician distribution, and the Rician factor is 10 dB [
31].
For the HAP drone
n, the transmission date from the IoT
can be expressed as follows [
32]:
where
is the spectral density of additional white Gaussian noise (AWGN).
is the wireless channel bandwidth between the terrestrial IoT device
j and HAP drone
n.
3.1.2. HAP Drone–LEO Satellite Communication Model
Because both HAP drones and LEO satellites are located at altitudes of at least 20 km above the ground, we consider the wireless channel between HAP drones and LEO satellites to be line-of-sight (LoS) channels, which comply with the free-space path loss model. According to 3GPP TR 38.811 [
33], we can model this channel as follows:
where
is the distance between the HAP drone
n and LEO satellite
m.
represents the carrier frequency of the wireless signal transmitted from the HAP drone to the LEO satellite, and we set it as
31 GHz.
The corresponding data rate between the HAP drone
n and LEO satellite
m can be formulated as follows [
34]:
where
is the transmit antenna gain of the HAP drone, and
is the receive antenna gain of the LEO satellite.
represents the path loss caused by environmental and atmospheric effects.
is the wireless channel bandwidth between the HAP drone
n and LEO satellite
m.
3.2. Computing Model
Each user’s task can be divided into four parts: one part can be processed locally, while the remaining parts can be processed on the HAP drone’s edge server or LEO satellite’s edge server. Below, we introduce each part in detail.
Local processing: Terrestrial users can put part of the task on the local CPU for processing. We denote as the proportion of task processed locally. Therefore, the data size of the task processed locally is , and the required CPU numbers can be expressed as .
Processing on directly connected HAP drones: In addition to the amount of data processed locally, the user will transmit other parts to the HAP drone directly connected to it. The amount of data in this part is . After the HAP drone receives this part of the data, it will directly put part of it on its own server for processing. We express the proportion of this part as . Therefore, the data size and required CPU numbers can be denoted as and .
Processing on forwarded HAP drones: The HAP drone can also offload part of the task to the HAP drone connected to it and process it on its server. The proportion of this part of the task is , and the amount of data and the required CPU numbers are and .
Processing on LEO satellites: The HAP drone is connected to the LEO satellite, so the remaining task can be forwarded to the LEO drone and processed on the LEO’s server. The proportion of this part of the task is .
3.3. Overall Delay Analysis
3.3.1. Local Processing
For terrestrial IoT device
j, it can process part of the task on the local CPU. Assuming that the processing frequency of IoT
j is
, the local processing delay can be written as follows:
3.3.2. Processing on Directly Connected HAP Drones
In addition to the parts processed locally, terrestrial IoT devices will offload other parts to the HAP drone directly connected to it, and then forward them to the served using the HAP drone, another HAP drone, or a LEO connected to it for processing. The amount of data sent by the terrestrial IoT device
j is
; therefore, the transmission delay can be expressed as follows:
where
represents the connection relationship between the IoT
j and the HAP drone
n. If the IoT
j is connected to the HAP drone
n,
; otherwise,
.
The processing delay of the IoT
j’s task in the HAP drone
n is as follows:
Therefore, the total processing of the IoT
j’s tasks on the directly connected HAP drone can be expressed as follows:
3.3.3. Processing on Forwarded HAP Drones
After receiving the task from terrestrial IoT devices, the HAP drone can also forward it to other HAP drones connected to it. We assume that HAP drones can be connected through laser links [
35], and the transmission bandwidth is very large. Therefore, we ignore the delay in data forwarding between HAP drones. The overall delay of task processing on directly connected HAP drones can be formulated as follows:
where
denotes whether HAP drone
n is connected to HAP drone
. If HAP drone
n is connected to HAP drone
,
; otherwise,
.
3.3.4. Processing on LEO Satellites
In addition HAP drones can also forward tasks to connected LEO satellites. After the task is forwarded to LEO, it can be processed on the server carried by LEO. The overall delay of the task processed on the LEO satellite can be formulated as follows:
where
denotes whether the HAP drone
n is connected to the LEO satellite
m. If the HAP drone
n is connected to the LEO satellite
m,
; otherwise,
.
is the computing frequency assigned by the LEO satellite
m to the terrestrial IoT device
j.
Therefore, the total delay of task
j can be formulated as follows:
For all IoT devices, the total delay can be calculated as follows:
5. Algorithm Design for
5.1. Problem Conversion
Firstly, to simplify the problem-solving process, we introduce auxiliary variables
, representing the overall delay for each terrestrial IoT device. Therefore, the original problem
can be rewritten as follows:
Since the variables and are coupled, this makes problem non-convex and difficult to be solved directly. Therefore, to reduce the complexity of the problem, we decompose problem into two subproblems: Subproblem 1 involves determining task splitting decisions, while Subproblem 2 involves determining the resource scheduling decisions. Below, we explain the two subproblems, respectively.
Subproblem 1, used to solve the task splitting strategy, can be represented as follows:
Subproblem 2 is used to determine the optimal resource scheduling strategy. It is important to note that the solution to Subproblem 2 is carried out under the assumption that the task splitting strategy is fixed, i.e., Subproblem 2 can be formulated as follows:
Below, we prove that jointly solving and can obtain the optimal solution of the original problem.
Proof. Assume that two task splitting strategies, and , are given. Their corresponding resource allocation strategies can be obtained by solving . If , then is better than . That is, is better than . By obtaining different , we can obtain that can make the total task processing workload smaller. Furthermore, by solving to obtain , and the corresponding to by solving , we can obtain the optimal solution to the original problem. This proves that, by jointly solving and , we can obtain the optimal solution to the original problem. □
In this paper, the quality of the solution obtained for Subproblem 1 (i.e., the task splitting strategy) depends on solving Subproblem 2. For each solution derived from Subproblem 1, Subproblem 2 must be tackled to determine the optimal resource allocation decision given the task splitting strategy. Then, by iteratively solving Subproblem 1, optimized task offloading and resource allocation decisions are obtained. Therefore, we will first introduce the solution method for Subproblem 2, followed by an explanation of how to address Subproblem 1.
5.2. Algorithm Design for the Optimization of
Under the fixed task splitting strategy (
), the total delay of all tasks depends on the allocation of communication and computation resources. Since we are aiming to minimize the total delay of all IoT devices’ tasks, we can infer that when the resource scheduling strategy is optimal, the transmission power is set to the maximum available transmission power of the devices, and the optimal allocation strategy for local computation frequency is determined by the maximum available computation frequency for each user. Therefore, problem
can be rewritten as follows:
where
and
denote the maximum data transmission rates achieved when transmitting power is maximized for IoT devices and HAP drones, respectively.
Clearly, problem
is a convex optimization problem, and we can directly determine its optimal solution using existing convex optimization solvers. In this paper, since our simulations are implemented in Python, we use CVXPY to solve this problem [
36,
37].
5.3. Algorithm Design for the Optimization of
To address the task splitting subproblem, we devised a solution framework based on the DDPG method. Initially, we transformed into a Markov decision process (MDP) model. Subsequently, we employed the DDPG-based method to solve the transformed problem.
5.3.1. MDP
In general, an MDP typically consists of states, actions, a reward function, a state transition function, and a discount factor. In this paper, we designate a specific HAP drone as the agent interacting with the environment, which includes satellites, terrestrial IoT devices, and other HAP drones. The HAP drone collects information such as user data and wireless channel gain to form state representation and then devises a task splitting strategy (i.e., actions) based on the states. Finally, the decision information is transmitted to satellites, HAP drones, and terrestrial IoT devices for execution. Below, we provide complete definitions for each concept.
(1) State Space: In this paper, state is defined as the environment variables of the system, expressed as , specifically including the following:
, which represents the data sizes of all terrestrial IoTs’ tasks.
, which represents computing density of all terrestrial IoTs’ tasks.
, which represents the gains of wireless channels between nodes in the system.
, which represents the maximum transmitter power of all terrestrial IoT devices.
, which represent maximum computing frequency of terrestrial IoT devices, HAP drones, and LEO satellites.
, which represents the maximum tolerated delay of all terrestrial IoT devices’ tasks.
(2) Action Space: In this MDP, the agent needs to output the task splitting strategy for each task. Because each task can be divided into four parts for processing, processing locally at the user, processing at the connected HAP drone, processing at the adjacent HAP drone, and processing at the LEO satellite, action can be expressed as (i.e., ).
(3) Reward Function: The goal of the MDP problem is to maximize rewards, while the goal of this paper is to minimize the total task processing delay, so we take the negative of all task processing delays and then accumulate them as rewards.
Therefore, the long-term discounted reward can be formulated as follows:
where
is the discount factor, which is used to discount the value of future rewards when calculating cumulative rewards. If the discount factor is too low, then the value of future rewards will be severely underestimated, which may lead to the agent making less informed decisions. If the discount factor is too high, the value of future rewards will be overestimated, which may lead to the agent adopting an overly conservative strategy. Therefore, choosing the right discount factor is very important. The total task processing delay is obtained by solving
based on state and action.
Based on the above discussion, the problem of maximizing the long-term discount reward can be expressed as follows:
5.3.2. Proposed Task Splitting Algorithm Based on DDPG
The algorithm for the optimization of task splitting in this paper is based on the DDPG method, as shown in
Figure 2. In this algorithm, four neural networks are included, namely an actor network, a critic network, a target actor network, and a target critic network. Below, we first introduce the actor and critic networks.
(1) Actor and Critic Networks:
The output of the actor network is a continuous action value, expressed as , where represents the parameters of the actor network. The output of the critic network is the evaluation of the action a output by the actor network, which can be expressed as , where represents the parameters of the critic network. During each round of training, the actor network will output an action decision a based on the current environment state s, and then the critic network will evaluate the decision; that is, obtain .
Our constructed actor network consists of two parts: a deep neural network (DNN) module and an action validity assurance module based on a normalization function, as shown in
Figure 3. The DNN module is responsible for making optimized task partitioning decisions based on the input state. However, the output of the DNN cannot guarantee compliance with constraints (
13b) and (
13c). Therefore, we designed the action validity assurance module to ensure that the actor network’s output meets these constraints.
First, we restructure the DNN’s output into a two-dimensional tensor (denoted as
), with dimensions of
J rows and four columns. We represent the
j-th row of
as
. Then, we normalize each
, and the processed result is expressed as
and
satisfies the following:
Obviously, each
satisfies
; that is,
satisfies constraints (
13b) and (
13c). In this way, we can ensure that the output of the actor network (i.e., the task splitting decision) is valid.
DDPG introduces an experience buffer mechanism to store states, actions, and rewards information, represented as
. During training, to update the critic and actor networks, a batch of
samples is first drawn from the replay buffer. For the sample
, the Critic network first computes the target value, satisfying the following:
Then, by minimizing the loss function , we can update the critic network by one step of gradient descent using .
In the same way, we can update the actor network parameters by one step of gradient descent using .
(2) Target Actor and Target Critic Networks:
In order to improve the stability of the training process, reduce the fluctuation of the training process, and reduce the variance of the training process, the DDPG method introduces the target actor network and the target critic network. The target actor network is used to output the best action of the next state, and the target critic network is used to evaluate the Q value of the next state. The structures of the target actor network and the target critic network are the same as the actor network and critic network. Moreover, the parameters of the target actor network and target critic network (corresponding parameters are
and
) are slowly updated from the actor network and critic network. The most commonly used parameter update is soft update, and the update method is as follows [
7]:
where
is the soft update factor, and
satisfies
, which is beneficial to improving the stability of training.
Based on the above analysis, we can summarize the intelligent task splitting and resource allocation algorithm proposed in this paper as Algorithm 1.
Algorithm 1 Task splitting and resource allocation algorithm |
- 1:
Randomly initialize the parameters of the actor network and critic network , . - 2:
Initialize the parameters of the target actor network and target critic network , . - 3:
Initialize the replay buffer . - 4:
for
do - 5:
The actor network generates action decisions based on the current state . - 6:
Execute action and obtain the corresponding reward based on solving , an obtain new state . - 7:
Store into the replay buffer . - 8:
Randomly sample small samples from the replay buffer . - 9:
The critic network calculates the target value - 10:
Update the critic network parameters by calculating the loss function , and update critic network by one step of gradient descent using . - 11:
Update the actor network parameters by one step of gradient descent using . - 12:
Update the parameters of the target actor network and target critic network through soft update ( 23) and ( 24). - 13:
end for
|
5.4. Complexity Analysis
The complexity of the algorithm proposed in this article mainly consists of two parts: reinforcement learning and convex optimization. First, we analyze the complexity of reinforcement learning. The complexity of the reinforcement learning method based on DDPG designed in this paper mainly depends on the scale of the actor network and critic network. We assume that the actor network and critic network are composed of
X layer and
Y layer fully connected networks. If the number of neurons in each layer of the network is
and
, the training complexity of DDPG is
[
38]. Among them,
,
, and
represent the computational complexity of the actor network input layer, actor network output layer, and critic network input layer, respectively;
and
are the number of elements in action and state. The complexity of DDPG online decision-making is
.
The computational complexity of the convex optimization algorithm depends on the computational complexity of solving
. The number of variables in
is
, and the number of constraints is
. Therefore the computation complexity of solving
is
[
39].
In summary, the computational complexity of the algorithm proposed in this paper in the training phase is , and the computational complexity of the online decision-making stage is .