1. Introduction
Against the backdrop of the widespread application of the Fifth Generation (5G) mobile communication technology, the Sixth Generation (6G) mobile communication technology has gradually become a research hotspot for scholars around the world, and the research direction in this field has demonstrated significant commercial potential, attracting extensive attention from the industrial sector, including the State Grid Corporation of China [
1]. With the rapid development of new power systems and the widespread coverage of the smart grid, the flexible and random access of massive and diverse entities put higher requirements on power communication. Different business requirements arise in a variety of typical power application scenarios, many of which generate a large amount of real-time data that need to be uploaded to cloud server data centers promptly for grid regulation and control [
2,
3].
In the blind area of traditional network coverage or power emergency communication scenarios especially, when ground networks experience damage, faults, or lack signal coverage, the ground network data collection methods may fail to meet the data transmission requirements of GPFs. The establishment of the space–air–ground-integrated power three-dimensional converged communication network, as one of the technological visions for 6G network design, can effectively supplement the existing ground power communication systems and provide reliable support for power data transmission. It will become an effective solution for achieving secure, reliable, and flexible transmission of power business data in the future.
The SAGIN can be divided into three parts: a space-based network, an air-based network, and a ground-based network [
4]. In the space-based network, satellite constellation systems led by LEO satellites [
5] can achieve seamless and extensive coverage and provide cloud server data processing services for GPFs. In the air-based network, auxiliary communication airborne platforms, such as Unmanned Aerial Vehicles (UAVs), can serve as amplifiers in satellite–ground communication, completing operations related to power amplification and forwarding signals. They can also act as microbase stations for the temporary storage and processing of data.
To further enhance the application potential of the SAGIN in power emergency communication scenarios, in this paper, we propose an algorithm based on RL for the joint optimization of AP trajectories, GPF associations, and power control strategies under classification-based throughput constraints. The algorithm aims to ensure the completion of communication tasks and maximize the long-term overall system energy efficiency. The main contributions of this paper are as follows:
- (1)
We integrate APs with the SAGIN to categorize the communication requirements of the GPF into real-time data transmission and non-real-time data transmission. For real-time data transmission, the priority is to maximize system throughput and ensure that communication latency is low enough for the data from the GPFs to be relayed by the APs and eventually uploaded to the cloud server on the LEO satellite for processing. For non-real-time data transmission, the goal is to maximize overall system energy efficiency by storing as much data as possible in the APs. After the communication task is completed, the APs fly to the corresponding ground management facility for data unloading.
- (2)
To solve the optimization problem, this paper employs an RL algorithm that models the AP flight paths as an MDP and proposes a multi-agent iterative optimization algorithm based on comprehensive assessments of GPF positions and communication workload. The GPF associations and power control are modeled as a multi-agent K-armed bandit problem, and we propose an algorithm based on multi-agent Temporal Difference learning to solve this aspect. The two algorithms alternate iterations, ultimately solving for maximum overall system energy efficiency.
- (3)
We verify the performance of the algorithms through simulation, and the results indicate that the proposed algorithm outperforms several benchmark algorithms, and it is effective in improving long-term overall system energy efficiency.
The remainder of this paper is organized as follows.
Section 2 introduces the related work in recent years. Then,
Section 3 introduces the system model and problem formulation.
Section 4 presents the solution approach, and
Section 5 verifies the algorithm performance through simulation. Finally,
Section 6 provides conclusions and future prospects.
2. Related Work
Currently, research on the SAGIN has made significant progress and produced valuable results. In terms of satellite-fused communication, reference [
6] proposed using satellites as cloud servers to provide remote computing services for ground users and jointly optimized communication and computing resources to minimize overall system energy consumption. Reference [
7] investigated the feasibility of seamless and efficient connections between ground communication systems and satellite networks. In terms of UAV trajectory optimization, reference [
8] proposed an algorithm that utilizes the successive convex approximation (SCA) technique to optimize UAV trajectories and minimize overall system energy consumption. Reference [
9] introduced an algorithm to discretize the UAV and user movement process in a multi-UAV scenario to maximize the minimum user transmission rate. Reference [
10] presented a trajectory optimization algorithm for minimizing the number of UAVs deployed in a vehicular network by solving the optimization problem using the SCA technique. Reference [
11] proposed an algorithm using RL with the Q-learning algorithm to optimize the trajectory of a single UAV to minimize average communication latency.
In addition to the research directions mentioned in the above literature, another focus is on optimizing device association and power control strategies to enhance system performance and communication quality. Reference [
12] proposed an algorithm based on block coordinate descent (BCD) and SCA theory for the joint optimization of device association, UAV trajectories, and power control to maximize the minimum user rate. Reference [
13] described an energy consumption minimization algorithm that jointly optimized user association matching and power allocation using BCD and SCA theory. In the aforementioned studies, optimizations were conducted considering parameters such as energy consumption, throughput, minimum user rate, and average communication latency in communication systems. However, the analyses were limited to models with a single constraint and lacked consideration of specific business communication requirements in practical scenarios. For instance, in power emergency communication scenarios, different services may have varying communication throughput requirements while considering system energy consumption. Consequently, a comprehensive optimization of system efficiency considering both throughput and energy consumption becomes an effective solution. Reference [
14] proposed an algorithm utilizing the genetic algorithm and simulated annealing to separately optimize device association and power selection to maximize the overall system energy efficiency. Reference [
15] presented a method for power allocation and wireless backhaul bandwidth allocation under specific quality of service (QoS) constraints, aiming to maximize the system energy efficiency of downlink heterogeneous networks.
Most of the resource allocation optimization algorithms mentioned above employ traditional heuristic algorithms or SCA techniques based on BCD, requiring multiple iterations for solving, leading to a significant increase in the computational complexity of the system. Additionally, as the network scale increases, the high-speed dynamic characteristics of the SAGIN result in continuous changes in wireless environment parameters, and the adaptability of algorithms becomes a challenge. Considering the use of RL algorithms to solve the resource allocation problem as an effective solution, reference [
16] proposed an algorithm utilizing RL to jointly optimize UAV paths, user device selection, and power allocation, ensuring fairness in user throughput. Reference [
17] employed the Q-learning algorithm to jointly optimize UAV paths, user selections, and power allocation to maximize network capacity. Reference [
18] proposed a power control scheme based on the Deep Q Network (DQN) to improve the system-level energy efficiency of two-layer 5G heterogeneous and multi-channel cells. Reference [
19] proposed an innovative algorithm generalization method based on incremental reinforcement learning to enable UAVs to adjust their control strategies in dynamic environments. Reference [
20] proposed a new multi-agent recurrent deterministic policy gradient algorithm to control the navigation actions of multiple UAVs. Reference [
21] introduced the state-of-the-art technologies for UAV-assisted maritime communications, discussed the physical layer, resource management, cloud/edge computing, and caching UAV-assisted solutions in maritime environments and grouped them according to their performance objectives. Reference [
22] presents a deep reinforcement learning-based method based on the design of a multi-head heterogeneous attention mechanism to facilitate the learning of a policy that automatically and sequentially constructs the route, while taking energy consumption into account.
There are also some highly effective research results in further integrating the SAGIN system model. Reference [
23] made a proposal to leverage LEO satellites and APs to provide edge computing services for ground users. In reference [
24], on the other hand, transmitted user tasks to cloud servers on LEO satellites for processing through UAV relay were proposed. The aforementioned literature focuses on the relevant research of UAV trajectory and user resource allocation algorithms in the SAGIN, but joint optimization research specifically for emergency communication scenarios in the power industry is still limited.
3. System Model and Problem Modeling
In this section, we analyze the business requirements of GPFs in the network and construct the space–air–ground-integrated power three-dimensional fusion communication network system model applied to power emergency scenarios. Then, we propose a communication model and an energy consumption model. Finally, the optimization problem is modeled.
3.1. System Model
We consider the uplink transmission in the wireless communication system depicted in
Figure 1. The system comprises a single LEO satellite, and
GPFs serve as users and
auxiliary communication airborne platforms serve as APs. In this configuration, the GPFs initiate power data transmission tasks, and the LEO satellite
, functioning as a space-based cloud server base station, replaces ground stations to receive signals.
Some GPFs periodically generate a substantial volume of real-time power business data, which is time sensitive and needs to be immediately transmitted to the cloud server within the 5G base station for processing. However, in power emergency scenarios where public network communication facilities are damaged or malfunctioning, the 5G base station signals cannot cover the area. So, LEO satellites are employed as substitutes for ground stations to fulfill the corresponding communication services. Due to the considerable distance between LEO satellites and the Earth’s surface, the direct connection speed of the GPF is relatively low, and the communication delay may not meet the needs of power businesses. To address this limitation, APs are utilized to relay and amplify certain signals, thereby enhancing communication transmission rates.
Additionally, some GPFs generate data involving non-real-time business operations that require large amounts of data transmission. Since this type of business operation is insensitive to time constraints, and considering efficiency and communication resource optimization, the data can be temporarily offloaded on APs. After completing communication tasks, the APs can fly to the nearest 5G ground station cloud server for data processing. In the system model tailored to the data transfer requirements of various business types, APs serve a dual role. They act as relay nodes for ground-to-satellite communication, facilitating signal forwarding [
9]. Simultaneously, APs can function as short-term data storage devices, providing diversified services to meet the data transfer needs of GPFs [
8].
GPFs are represented by the set , while APs are denoted by the set . The system employs a quasi-static model, dividing a coverage cycle into time slots of length , with . It is assumed that the positions of the GPFs are known and remain approximately constant throughout the task duration. The location of GPF is represented as . The APs fly at a fixed height of approximately above the ground, with negligible height differences between them to prevent collisions. Within each time slot, the horizontal position of AP is considered constant and denoted as .
3.2. Communication Model and Energy Consumption Model
In this paper, Frequency Division Multiple Access (FDMA) technology is employed for communication to prevent interference between signals from different devices [
15]. Binary variables
are used to represent the association between GPF
and AP
within each time slot. Binary variables
are used to represent the association between GPF
and LEO satellite
within each time slot.
denotes that in the current time slot, GPF
chooses to upload data to AP
; otherwise,
.
indicates that in the current time slot, GPF
opts for a direct connection to LEO satellite
for data upload; otherwise,
. Each device can choose only one data transmission mode in the same time slot, either connecting to the LEO satellite or a specific AP, and each device is subject to constraints as follows:
According to [
12], the connection between GPFs and LEO satellites, as well as APs, is characterized by Line-of-Sight (LoS) transmission links. Therefore, the channel gain between GPF n and LEO satellite s is defined as
where
represents the channel power gain between GPF
and LEO satellite
at a unit distance
.
denotes the transmit antenna gain of GPF
,
denotes the receive antenna gain of LEO satellite
,
represents the carrier wavelength,
represents the speed of light,
represents the carrier frequency, and
represents the distance attenuation factor.
Furthermore, since the distance between the GPFs and the LEO satellite is far away, the distance between LEO satellite
and GPF
is simplified to the altitude of the LEO satellite, denoted as
. The direct communication transmission rate between GPF
and LEO satellite
is given by
where
represents the allocated fixed bandwidth for communication between GPFs and LEO satellites and
denotes the variance of Additive White Gaussian Noise (AWGN).
represents the transmission power of GPF
during time slot
, which is subject to the constraint imposed by the maximum transmission power
of the power facility as follows:
when the immediate communication workload
required by GPF
is substantial, due to the constraints on the transmission rate
, the direct transmission rate
may not be able to complete the communication task of GPF
within the coverage time.
In such cases, the GPFs will consider completing the communication task using the relay forwarding assistance of APs. In time slot
, the channel gain between GPF
and AP
is defined as
where
represents the channel gain at a unit distance.
denotes the receive antenna gain of AP
. According to Shannon’s formula, the transmission rate for communication between GPF
and AP
is defined as
and relays through AP
. According to [
14], the transmission rate between GPF
and LEO satellite
is defined as
where
represents the Signal-to-Noise Ratio (SNR) between GPF
and AP
,
is the SNR between AP
and LEO satellite
, and
is the fixed forwarding power of AP
. It is evident that when the AP has a sufficiently large forwarding power
, the transmission rate
of GPF
will significantly improve compared to the direct satellite transmission rate
.
In conclusion, the actual data transmission rate of GPF
for immediate communication tasks in time slot
can be expressed as
where the first item
represents the rate at which the GPF establishes a connection with an AP and
represents the rate at which the GPF communicates directly with the LEO satellite. In (1), we can see that the GPF can only establish one connection in a single time slot. The actual data transmission rate of GPF
during time slot
for non-immediate communication tasks is
The actual communication energy consumption generated by GPF
during time slot
is
where
represents the user transmission power when associated with the
th device. If the actual transmission rate of GPF
at any time
is
, according to [
18], the long-term system total energy efficiency is defined as
3.3. Problem Modeling
Let the set of association factors between GPFs and LEO satellites be denoted as . The set of association factors between GPFs and APs is denoted as . The set of horizontal positions of APs in each time slot is denoted as , and the set of transmission powers of GPFs in each time slot is denoted as . represents the GPF set, represents the AP set, and represents the time slot set.
In this paper, we conduct joint optimization of the device association factors
and
, AP flight trajectories
, and GPF transmission powers
to maximize the long-term system total energy efficiency
. The optimization problem is expressed as follows:
C1: ;
C2: ;
C3: ;
C4: ;
C5: ;
C6: ;
C7: ;
C8: ;
C9: .
C1 and C2 represent binary variables, indicating the association factors between GPFs and LEO satellites and APs, respectively. C3 indicates that any GPF can only choose a single AP or LEO satellite for data transmission in any time slot. C4 and C5 denote upper limits on the number of GPFs that APs and LEO satellites can associate with in any time slot. C6 represents an upper limit on the total transmission power of GPFs in any time slot. C7 indicates that each GPF needs to meet its respective minimum immediate communication workload requirement. C8 signifies that each GPF needs to satisfy its respective minimum non-immediate communication workload requirement. C9 denotes that the position variation of APs in each time slot is constrained by their maximum flight speed.
4. Reinforcement Learning-Based Joint Optimization Strategy for Flight Path and Resource Allocation
In this section, we split the problem into two subproblems and modeled them as an MDP and a multi-agent time-varying K-armed bandit model, respectively. The reinforcement learning algorithm is used to solve the subproblems and iterate alternately to find the optimal strategy of agents.
4.1. Problem Description and Decomposition
The objective of this paper is to maximize the overall energy efficiency of the system within a given operational cycle
(referred to as an episode). We optimize the position of the APs in each time slot, as well as the device association and power control strategies for GPFs. The optimization problem can be decomposed into subproblem
P2 for the optimization of the AP flight paths and the joint optimization subproblem
P3 for GPF device association and power control as follows:
C1: ;
C2: ;
C3: ;
C4: ;
C5: ;
C6: ;
C7: ;
C8: .
Subproblem P2 assumes that the connection of devices and their transmit power are given in each time slot. The APs need to consider the communication task requirements of GPFs and the total channel gain that they can provide to GPFs to determine their movement. We propose an online path optimization algorithm based on RL to address this problem.
Subproblem P3 assumes that the positions of the APs are given in each time slot. GPFs need to choose an appropriate device association and provide transmit power for data transmission within each time slot. From the constraints of the subproblem, it can be observed that each GPF can only choose one device for association in each time slot, and each device has a maximum connection limit, leading to the issue of connection competition. After determining the associated device, GPFs further optimize their transmit power to achieve maximum throughput. We propose a joint optimization sub-algorithm for device association and power control based on RL to address this problem.
4.2. Aerial Platform Flight Path Optimization Algorithm
In this paper, each AP makes independent decisions in each time slot based on the value of each state provided by environmental feedback. The environment changes based on the decisions made and the rewards obtained in each state, eventually forming a finite MDP within an episode. Therefore, the MDP can be employed to model this subproblem.
In this model, each AP is considered as an intelligent agent. The system can be described using a quadruple , where represents the number of agents, represents the states in which each agent is located, represents the actions chosen by the agents in those states, and represents the rewards obtained from the actions of the agents in those states.
Based on the four elements mentioned above, the definition of the agent’s state space, action space, and reward function is as follows:
- (1)
State Space: The entire two-dimensional coordinates within the communication task range of all APs are discretely represented. The obtained set of two-dimensional coordinates constitutes the state space of the agents. The position of each agent in each time slot represents the agent’s state .
- (2)
Action Space: we simplify the Computational complexity in this algorithm by considering the agent’s maximum speed constraint and selecting a set of equidistant fixed displacement values . By applying these fixed displacement values to both the and coordinates of the two-dimensional state positions, a combination of attainable target positions in one time slot is generated. The selection of all fixed displacement values for the and axes constitutes the action space of the agent.
- (3)
Reward Function: The reward function
for AP
is defined as the weighted sum of the channel gains of various GPFs at the target position based on their communication workload. This sum is then added to a function of the distances between the AP and other APs and multiplied by a certain proportionality factor
. The expression is as follows:
where
serves as the weighting factor for the workload of GPFs.
represents the channel gain between GPF
and AP
.
denotes the distance between AP
and other APs. The purpose of the second term is to prevent prolonged path overlap between different APs. Based on the proposed model, we present a comprehensive sub-iteration multi-agent iterative AP path optimization algorithm that considers both the positions and workload of GPFs. The specific algorithm procedure is illustrated in Algorithm 1.
Initially, the maximum training episodes
are initialized, and the initial state
is given for each AP (Step 1). In each time slot within an episode loop (Step 2), the algorithm calculates the reward function for each reachable state of each AP (Step 2a). According to a greedy strategy, the APs move and update their states
(Step 2b). In the overall algorithm flow, the weights of the reward function are updated based on the remaining communication workload for GPFs derived from Algorithm 2 (Step 2c). After completing one training episode, the primary algorithm parameters are initialized, and the next training episode is executed (Step 3). This process is repeated until all training episodes are completed (Step 4).
Algorithm 1: Aerial Platform Path Optimization Algorithm Based on Comprehensive Evaluation of GPF Position and Workload |
(1) Initialization: Maximum training episodes , for any AP , given its initial position state . |
(2) For each iteration in episode , , instruct each AP to independently execute the following steps: |
(a) Based on the positions of each GPF, remaining workload, and the current state of the AP in this time slot, update the reward function for each reachable target state of AP m using the following expression: calculate the reward function for the reachable target positions of AP m in the current time slot; |
(b) According to a greedy strategy, each AP will move towards the target state with the highest reward and update its state ; (c) Conduct the iteration of Algorithm 2 for each GPF, update the weight for the remaining workload of each GPF, return to step (a), and proceed to the next iteration until completion. |
(3) Set . |
(4) Repeat steps (1) to (3) until . |
4.3. Device Association and Power Control Algorithm
In this paper, when GPFs make action selections, the moving strategy of APs is not considered. Only the current position of APs at the current time slot is taken as the environmental state. Therefore, the decision-making process for device association and power control of GPFs can be modeled as a time-varying K-armed bandit model with multiple intelligent agent states. The action selection of the intelligent agent is considered not to affect changes in the environmental state.
This model regards GPFs as intelligent agents, and the system can be represented by a quadruple . The definition of the agent’s state space, action space, and reward function is as follows:
- (1)
State Space: To simplify the algorithm complexity, in the proposed algorithm, the distances between APs and intelligent agents are discretized into state distributions based on the provided channel gain. The distances between each AP and intelligent agent constitute a state space with elements.
- (2)
Action Space: Defined as the device association and power control schemes that agents can choose. Agents determine their action choices based on the current state matrix and the policy specified by the algorithm at the current time.
- (3)
Reward Function: The reward function for the agent
at time slot
is defined as the instantaneous throughput and instantaneous energy consumption as follows:
where
represents the transmission rate of the agent
in time slot
and
represents the actual communication energy consumption generated by the agent
in time slot
.
Let
represent the estimated value of the action chosen by intelligent the agent
in state
after
selections, which is expressed as follows:
Let
represent the estimated cost of the agent
choosing action
in state
after
selections, which is expressed as follows:
then, the efficiency estimation function for the agent
choosing action
in state
for the
-th time can be expressed as
The policy for the action selection of the agent
is represented as
This signifies that the agents always choose the strategy that maximizes the current estimated energy efficiency. However, to balance exploration and exploitation, there is a probability
for exploration to discover better actions. The specific policy is expressed as follows:
and to enhance the algorithm’s overall performance and expedite the convergence speed of iterations,
can be set as a linear function related to the iteration episodes
, as shown below:
where
represents the exploration rate and
denotes the rate at which the exploration rate decreases with
. When
, the training process will conclude, yielding the corresponding training results.
Based on the aforementioned model, we propose a device association and power control algorithm based on the multi-agent K-armed bandit. The specific algorithm procedure is illustrated in Algorithm 2.
Firstly, we initialize the maximum training episodes
, exploration parameter
, state-action value function
, and energy consumption estimation function
, along with the positions and initial workload
for each GPF (Step 1). In each time slot within an episode loop (Step 2), it is determined whether GPFs have immediate data tasks based on
. If immediate tasks exist, execute policy
(Step 2a), which means that GPFs will upload real-time data to the nearest AP at maximum power. If the optional APs have reached the maximum number of connections, they will be directly connected to the LEO satellite. If only non-immediate data tasks remain, select action
based on policy
(Step 2b). Then, calculate the instantaneous reward
and instantaneous cost
based on action
, update the remaining workload, and use the remaining workload as a state change for the iteration of Algorithm 1 in the next step (Step 2c). Update the state-action value function
and energy cost estimation function
using
and
, and compute the energy efficiency estimation function
for the current time
(Step 2d). After completing one training episode, initialize the main algorithm parameters and execute the next training episode (Step 3). Repeat the above process until all training episodes are completed (Step 4).
Algorithm 2: Algorithm for Device Association and Power Control Based on Multi-Agent K-Armed Bandit |
(1) Initialization: Maximum training episodes , exploration parameter , state-action value function , and energy cost estimation function . (2) For GPF n, given its position and initial workload . |
(3) For each iteration in episode , , prioritize GPFs based on the type and quantity of the remaining workload. Then, instruct each GPF to independently execute the following steps: |
(a) Check if immediate data tasks still exist; if yes, execute policy ; |
(b) If only non-immediate data tasks remain, select action , based on policy ; (c) Based on the executed action , obtain instantaneous reward , and instantaneous cost , update the remaining workload, and use the remaining workload as a state change for the iteration of Algorithm 1 in the next step; (d) Update the state-action value function and energy cost estimation function based on the instantaneous reward and cost, and update the energy efficiency estimation function for the current time. |
(4) Set . |
(5) Repeat steps (2) to (4) until . |
4.4. Overall Algorithm Flow
Combining Algorithm 1 and Algorithm 2, we propose a joint optimization strategy for flight path and resource allocation based on RL, as shown in Algorithm 3. In each time slot, the algorithm first runs Algorithm 1 based on the remaining throughput of GPFs to obtain the current coordinates of APs for that time slot (Step 2b). Subsequently, each GPF obtains the state of the current time slot. Based on the trained energy efficiency estimation matrix, Algorithm 2 is executed to select the optimal resource allocation strategy (Step 2c). The specific flow is illustrated in
Figure 2.
The concept of the proposed algorithm draws on the Monte Carlo method and the single-step Temporal Difference method. It updates the value at a single moment and selects actions through a soft greedy strategy to balance exploration and optimization. It only requires a certain sampling sequence, that is, from the environment. Using the sequence of states, actions, and benefits sampled, we can iterate to the optimal strategy.
Algorithm 3: A Joint Optimization Strategy for Flight Path and Resource Allocation Based on RL |
(1) Input: GPFs set , APs set , each algorithm parameter, Environmental state information, GPFs’ workload , the upper limit of transmission power of GPFs . Output: Optimal Flight Path strategy and optimal device association and power control strategy . |
(2) for each training episodes do |
- a.
for each time slot do - b.
Based the GPFs’ workload and position obtained from Algorithm 2 in time slot , decide the Flight Path strategy according to Algorithm 1. - c.
Based on Flight Path strategy obtained from Algorithm 1 in time slot , the device association and power control strategy is decided using Algorithm 2. - d.
Update Environmental state information. - e.
. - f.
end for - g.
until .
|
(3) . |
(4) end for |
(5) until . |
(6) Return . |
5. Simulation Results
The main simulation parameters are listed in
Table 1 [
14,
16,
18]. The simulation environment in this paper is based on MATLAB and is employed for computational simulation to validate the effectiveness of the proposed algorithm. It is assumed that in a square area with an extent denoted as
, GPFs are randomly distributed. Each GPF is characterized by specific real-time and non-real-time communication throughput requirements. Three APs are assumed to fly within this area at an altitude denoted as
, aiming to establish LoS communication with GPFs. Additionally, an LEO satellite serves as a cloud server data center, processing real-time power business data from GPFs. Specifically, the APs depart from a predefined starting position and, within a specified time denoted as
, fulfill the communication requirements of all GPFs. The specific simulation parameters are detailed in
Table 1.
The number of algorithm iterations
, discretization state parameters
and
, and distance scaling factors
in
Table 2 and
Table 3 are reasonably set, taking into account factors such as the size of the simulation environment equipment, computer computing power, and simulation time.
Figure 3 depicts the flight paths of three APs in the system when accommodating 10 GPFs. It can be observed that the APs tend to move towards regions with denser GPF positions. Additionally, the three APs predominantly serve distinct groups of GPFs. As the remaining throughput requirements of GPFs change, the APs continuously adjust their positions. While meeting the demands of the majority of GPFs, efforts are made to consider GPFs with a higher remaining throughput. This indicates that through optimized flight paths, APs can enhance service provision for GPFs, consequently improving the overall system energy efficiency.
In
Figure 4, the variation of the overall system energy efficiency with the number of training episodes is compared under different exploration rates. To visually represent this variation more intuitively,
Figure 5 averages the obtained actual energy efficiency values every 10 episodes to obtain the corresponding fluctuation curve. It can be observed in the graph that when the exploration rate is relatively high, the algorithm converges to a lower maximum energy efficiency value. However, the exploration of possible action strategies is more thorough. As the exploration rate decreases, the convergence speed of the algorithm decreases, but it can converge to higher energy efficiency values. The proposed algorithm incorporates a decreasing exploration rate with an increasing number of training episodes. It initially sacrifices energy efficiency to enhance convergence speed, and after a sufficiently thorough exploration of the policy, it converges to the globally optimal energy efficiency value, aiming to improve the algorithm’s performance.
To illustrate the effectiveness of the proposed algorithm in this paper, the algorithm is compared with the following three algorithms:
- (1)
Random Path Algorithm (RP algorithm): Within the state space defined in Algorithm 1, the movement of each AP is set to move towards a random GPF in each time slot. To facilitate comparison, the device association and power control strategy of Algorithm 2 remain unchanged.
- (2)
Random Device Selection Algorithm (RS algorithm): All GPFs are randomly associated with available APs. The path selection for APs and power control strategy for GPFs remain consistent with the proposed algorithm.
- (3)
Equal Transmit Power Algorithm (EP algorithm): All GPFs set their transmission power to the same fixed value, ensuring that each GPF meets its throughput requirement. The path selection for APs and device association strategy for GPFs remain consistent with the proposed algorithm.
Figure 6 depicts the curve of total system energy efficiency against noise power variations. The energy efficiency of all four algorithms decreases with the increase in noise power. Higher noise power leads to a significant reduction in the transmission rates of GPFs, and when the noise power is excessive, the system energy efficiency decreases substantially, as GPFs must increase their maximum transmit power to meet throughput requirements. It is evident that, under all noise power conditions, the performance of the proposed algorithm is superior to the other three algorithms. The curve of the energy efficiency of the proposed algorithm changing with noise power is improved by 17.23%, 162.87%, and 4.45%, respectively, compared with the RP algorithm, EP algorithm, and RS algorithm.
Figure 7 illustrates the curve of total system energy efficiency with respect to variations in bandwidth allocation for GPFs. The total system energy efficiency increases with the increase in channel bandwidth allocated to GPFs. Higher channel bandwidth provides GPFs with higher transmission rates, leading to an overall improvement in system energy efficiency. It can be observed that, under all scenarios of GPF channel bandwidth allocation, the performance of the proposed algorithm is superior to the other three algorithms. The curve of the energy efficiency of the proposed algorithm changing with channel bandwidth for ground devices is improved by 16.63%, 116.42%, and 4.96%, respectively, compared with the RP algorithm, EP algorithm, and RS algorithm.
Figure 8 depicts the variation curve of total system energy efficiency with respect to the throughput load rate of APs. Constrained by their flying altitude, APs have a limited maximum throughput they can provide to GPFs in each time slot. The throughput load rate is defined as the ratio of the total GPF throughput demand to the maximum throughput that the APs can provide. In the graph, it can be observed that when the load rate is below 100%, the system’s total energy efficiency remains relatively stable. However, when the load rate exceeds 100%, it shows a decreasing trend and may exhibit a sharp decline under excessively high load rates. This is due to the inability of APs to meet the throughput demands of GPFs, forcing an increase in the transmission power level of GPFs and resulting in a decrease in the overall system energy efficiency. This can be improved by adjusting the altitude of APs or allocating a larger bandwidth to each GPF. Under any load rate condition, the performance of the proposed algorithm is superior to the other three algorithms. The curve of the energy efficiency of the proposed algorithm changing with throughput load rate is improved by 7.92%, 34.71%, and 6.79%, respectively, compared with the RP algorithm, EP algorithm, and RS algorithm.
Figure 9 illustrates the variation curve of total system energy efficiency with respect to the number of GPFs. In the graph, the number of GPFs starts from six and incrementally increases up to thirteen. When the number of GPFs is low or too high, the performance of each algorithm decreases. This may be attributed to a mismatch between the scaling factor
in Algorithm 1 and the number of GPFs. Validation confirms that adjusting the scaling factor
can improve algorithm performance. The algorithm achieves its maximum performance when the number of GPFs equals the sum of the maximum connectable devices for each AP. As the number of GPFs gradually increases, the total system energy efficiency shows a decreasing trend. This is due to intensified competition for connections among GPFs in each time slot as the number of GPFs increases, leading to some GPFs being unable to establish connections with APs. Improvement can be achieved by increasing the number of APs or enhancing the number of connectable devices for each AP. It is evident that, under most scenarios of device numbers, the performance of the proposed algorithm is superior to the other three algorithms. The curve of the energy efficiency of the proposed algorithm changing with the number of ground devices is improved by 7.13%, 31.15%, and 4.23%, respectively, compared with the RP algorithm, EP algorithm, and RS algorithm.
In the table below, we list the quantitative indicators of the proposed algorithm’s improvement in energy efficiency. It can be seen that the proposed algorithm has a significant improvement compared with the EP algorithm, but the optimization effect is not obvious compared with the other two algorithms. This may be because the number of APs in the simulation scenario is small and the users are relatively close to each other, so the effect of equipment selection and path optimization is not obvious enough. However, in future work, we will study the larger number and distribution of users, and it is expected that more ideal results can be obtained.
In addition to the analysis of simulation performance, we also analyze the limitations and computational complexity of the algorithm. First of all, in terms of algorithm limitations, compared with the RP algorithm, the proposed algorithm considers the path optimization of the AP and will perform better in scenarios where the AP needs to move in a large range. Compared with the AP algorithm, the proposed algorithm can optimize the transmit power of GPFs, adapt to the different communication needs of each GPF, and greatly improve the energy efficiency of the system. Compared with the RS algorithm, the proposed algorithm can optimize the device selection of the GPF and ensure that the system can perform better in scenarios where a large number of GPFs are connected to the network. The proposed algorithm breaks through the limitations of GPF’s large-scale distribution, multi-service communication, and large-scale access. Therefore, the proposed algorithm has strong applicability. However, when dealing with larger-scale state space and action space problems in the future, the proposed algorithm needs to perform a certain discretization of the state space and action space, and the training time is long, and sensitive to parameter selection. This is also the limitation of the proposed algorithm.
In this article, Algorithm 1 is used to optimize the flight path of APs, and its computational complexity is
, where
represents the number of iterations,
represents the number of task slots, and
represents the number of APs. Algorithm 2 is used to optimize the device selection and power control of GPFs, and its computational complexity is
, where
represents the number of GPFs. The computational complexity of the overall algorithm flow is
. As shown in
Table 4, except for the slightly lower complexity of the RP algorithm, the complexity of the other two benchmark algorithms is consistent with the proposed algorithm.
6. Conclusions
In this paper, we study the energy efficiency optimization problem of the space–air–ground-integrated power three-dimensional converged communication network in emergency communication scenarios. On the basis of the system model, we establish the optimization problems of joint AP flight path selection, GPF association, and power control. The AP flight path is modeled as an MDP, and the AP path optimization algorithm is proposed based on a comprehensive evaluation of GPF location and workload. The GPF association and power control subproblem is modeled as a multi-agent, time-varying K-arm bandit model. We propose a device association and power control algorithm based on Temporal Difference learning to solve this problem. Combining these two algorithms, we propose a joint optimization strategy for flight path and resource allocation based on RL. Through the collaborative iteration of the two algorithms, the optimization of AP flight paths, GPF association, and power control can be achieved, which can improve the overall energy efficiency of the system while meeting certain throughput requirements. Finally, through computer simulations and comparisons with three other benchmark algorithms, the proposed algorithm’s effectiveness in enhancing the overall system energy efficiency is validated. In practical scenarios, adjusting the scaling factor of the AP can adapt to different numbers of GPFs to achieve optimal system performance under current throughput requirements.
In future work, we will improve the settings of some parameters in the algorithm, integrate deep learning content to further optimize the algorithm iteration process, train neural networks as a framework for strategy selection, and conduct simulations in a larger-scale environment compared with other classic algorithms in the field of reinforcement learning to conduct a more in-depth analysis of the simulation results.