1. Introduction
The development of the maritime Internet of Things (IoT) facilitates various maritime activities of humans more frequently and greatly promotes the development of the maritime economy. The surging demand for wireless communication puts great pressure on maritime wireless communication systems for the next generation of wireless communication technologies (such as beyond Fifth-Generation (5G) and Six-Generation (6G)) [
1,
2]. However, maritime wireless communication technology is relatively behind compared with that for land [
3]. For example, the maritime wireless communication system lacks the construction conditions of communication facilities (such as base stations), and the maritime spectrum resources are scarce [
4]. Besides, complex sea surface changes (such as wave motion and evaporation duct) make the maritime wireless communication environment unstable [
5]. Hence, it is of great significance to research the maritime IoT to realize flexible, efficient, and reliable transmission.
In the maritime IoT system, buoys embedded with a variety of sensors are usually deployed on the sea surface to monitor the maritime environment. However, these buoys are generally out of the service coverage of the offshore base stations (OBSs). Although monitoring data can be transmitted through the multi-hop network formed by buoys and ships, the connectivity of the network will be seriously damaged when nodes in the network fail [
6]. Satellite communication could mitigate the issue of network failure, but it has fatal drawbacks in terms of high delay, low-link reliability, and high cost [
7]. The unmanned-aerial-vehicle (UAV)-assisted maritime communication system is regarded as a promising solution to deal with the high delay and low reliability. The UAV as the aerial mobile base station has the following advantages in assisting maritime wireless communication. On the one hand, the high maneuverability of the UAV makes it easier to establish line-of-sight (LoS) communication link between UAVs and buoys and also allows them to flexibly approach buoys to enhance the communication links [
8]. Hence, the UAV can better adapt to the complex maritime wireless communication environment and is very suitable for emergency communication scenarios. On the other hand, the UAV usually has the advantages of a small size, low cost, and easy control [
9]. The application of UAVs can reduce the risk of manual operation in some dangerous maritime environments. The above advantages also make UAVs have high concealment and be widely used in maritime military activities.
However, the size of the batteries embedded in UAVs and buoys is limited. It will affect the efficiency of UAV’s mission execution, if UAVs frequently return to the charging area. Moreover, frequent battery replacement will increase the maintenance cost of buoys, and in fact, it difficult to achieve due to the bad condition of the sea environment. Hence, the performance of UAV-assisted maritime wireless communication is severely limited by the energy consumption of UAVs and buoys [
10]. The energy consumption disadvantage could be compensated by multi-UAV cooperative data transmission. The multi-UAV communication system has a wider communication range and a shorter transmission distance, which can reduce the communication delay and improve the energy efficiency [
11]. In the multi-UAV communication system, UAVs need to collect the data cached in buoys and offload the data to the OBS. In this process, the UAV will be far away from the OBS, which may make it take a long time to offload the data. Hence, it is important to design the trajectories of multiple UAVs to ensure the data transmission requirement under the condition of limited energy and shorten the mission completion time of data collection and data offloading.
Recently, there has been extensive research work focused on the multi-UAV-assisted wireless communication scenarios. Diao et al. [
12] solved the multi-UAV deployment problem by a circular deployment scheme and jointly optimized the offloading and scheduling strategies to minimize the weighted sum of transmission and hover energy consumption. Kuo et al. [
13] studied multi-UAV-assisted data collection scenarios with circular trajectories and considered the deployment of UAVs and device association to minimize the total energy consumption of IoT devices. Gao et al. [
14] jointly optimized UAV trajectories, ground equipment scheduling, and power allocation by using the technologies of block coordinate descent and successive convex approximation to maximize the overall throughput of ground devices. However, the above work only considered the uplink or downlink transmission scenarios. Hua et al. [
15] studied the simultaneous uplink and downlink transmission systems of multiple UAVs, in which one UAV is responsible for downlink transmission and the other UAV is responsible for uplink transmission. Furthermore, Liu et al. [
16] studied that each UAV can dynamically change the transmission mode in each time slot, but the premise of this work is to know the UAVs’ trajectories. It is significant that UAVs can change the mission mode according to the transmission requirements. For example, a UAV adopts the mode of offloading after completing the collection mission. When the UAV fails during the collection mission, all the data collected by the UAV cannot be returned to the OBS, resulting in data loss. The high coupling of UAV trajectories, association relationships, and mission mode is a challenge for us, which can usually be described as a mixed-integer non-convex problem. The above works effectively solved this kind of problem through traditional convex optimization techniques and iterative methods, but they have high computational complexity and the curse of dimensionalitywith the increase of the action and state dimension of the system due to the high dynamics of UAVs.
With the development of machine learning technology, deep reinforcement learning (DRL) algorithms have received extensive attention with the advantage of dealing with highly dynamic environments [
17,
18,
19]. In reinforcement learning, the agent maximizes the cumulative reward by constantly exploring the environment. DRL combines the advantages of the deep neural network (DNN) and can handle more complex actions and states. At present, much research has applied DRL algorithms to multi-UAV trajectory optimization. References [
20,
21,
22] studied the multi-UAV trajectory optimization scheme based on the multi-agent deep deterministic policy gradient (MADDPG) algorithm and regarded the UAVs as the agents. Wu et al. [
20] aimed to minimize the average age-of-information (AoI) and proposed an MADDPG algorithm to jointly optimize the UAV sensing location and transmission location. However, the MADDPG algorithm can only handle a continuous action space. Hence, Wang et al. [
21] combined a low-complexity mission decision-making method with the MADDPG algorithm. Gao et al. [
22] used a potential game approach to determine the service allocation between users and multiple UAVs, which effectively solved the mixed-integer non-linear problem. For using DRL algorithms to process the hybrid discrete and continuous action space, it undoubtedly reduces the flight accuracy of UAVs if a continuous action is discretized, and this is inconsistent with the actual situation of UAV flight actions; relaxing the discrete action space into a continuous action space increases the complexity of the action space, such as in Hausknecht et al. [
23]. Xiong et al. [
24] proposed the parameterized deep Q-network learning (PDQN) algorithm based on the parameterized action space. Yin et al. [
25] proposed a multi-agent parametrized deep Q-network (MAPDQN) algorithm to maximize both the overall throughput and the fairness throughput. However, the action decision-making method of the PDQN algorithm may reduce the maximum Q-value [
26]. Then, Fan et al. [
26] proposed a hybrid proximal policy optimization (HPPO) algorithm, which can efficiently solve the hybrid action space based on the proximal policy optimization (PPO) algorithm. We summarize the difference between our work and the existing related references in
Table 1.
Based on the above discussion, this paper studies the multi-UAV-assisted maritime IoT systems, in which the missions of the UAVs consist of collecting data from buoys and offloading the data to the OBS. The UAVs could adaptively change the mission mode according to the network resources and mission requirements. Our goal was to minimize the total mission completion time by jointly optimizing the UAV trajectories, mission mode selection, the transmit power of the buoys, and the buoy–UAV and UAV–OBS association relationships with the constraints of energy consumption and a no-fly zone. The main contributions of our paper are summarized as follows:
We propose an adaptive data transmission mission mode framework for the multi-UAV-assisted maritime IoT system. Specifically, UAVs are allowed to change the mission mode in each time slot according to the channel quality and data volume, which improves the flexibility of offshore data collection and offloading.
A multi-UAV trajectory optimization algorithm based on multi-agent HPPO (MAHPPO) is proposed to overcome the problem of a hybrid discrete and continuous action space. The total mission completion time is effectively shortened through the flexible scheduling of the UAV trajectories, mission mode, and transmit power of buoys.
In order to reduce the exploration difficulty and action space dimension of the MAHPPO algorithm, we propose an algorithm based on the stable marriage problem (SMP) to determine the buoy–UAV and UAV–OBS association relationships. The simulation results show that the proposed algorithm can effectively reduce the total mission completion time with low computational complexity.
The remainder of the paper is organized as follows. The multi-UAV data collection and offloading model and the problem formulation are given in
Section 2. Then, the problem analysis and the proposed algorithm are given in
Section 3. The simulation results are given in
Section 4. Finally, we give a conclusion in
Section 5.
2. System Model and Problem Formulation
As shown in
Figure 1, we considered a UAV-assisted maritime IoT system where
U UAVs with the same fixed flight height
H were used as the aerial base stations to execute the data collection–offloading mission in the target area. The mission of the UAVs is to collect the hydrometeorological data sensed by
M buoys with a random distribution in the target area and offload all the collected data to the OBS. The sets of UAVs and buoys are denoted as
and
, respectively. The total mission completion time of the UAVs is divided into multiple equal time slots and is denoted as
, where
K is the number of time slots,
is the set of time slots, and
is the duration of each time slot. Note that, at each time slot, the UAVs need to choose the appropriate operation mode between data collection and data offloading according to the current network states. In each time slot, the UAV
u cannot collect and offload data at the same time.
The horizontal coordinate of UAV
u at time slot
k is denoted as
. The flight direction of UAV
u is limited by an angle of
and a velocity of
, where
is the maximum flight velocity of the UAV. The coordinates of UAV
u can be expressed as
and
, respectively. Considering the limitation of the flight range,
and
are respectively constrained by
where
and
are the length and width of the target area, respectively. Moreover, the distance between UAV
u and UAV
at == time slot
k can be expressed as
. In order to avoid a collision, it is necessary to ensure the minimum safety distance
between UAVs, which is expressed as
Denote the set as the set of OBS and buoys, where means the OBS. The horizontal coordinate of buoy m and the OBS is . Hence, the distance of the buoy–UAV and UAV–OBS at time slot k can be given by .
2.1. Channel Model
Due to the special maritime propagation, we adopted the model of the air-to-ground channel and the two-ray path loss model [
27]. More specifically, the probability of the LoS link is expressed as
, where
a and
b are two constant values depending on the environment.
denotes the elevation angle of the buoy–UAV and UAV–OBS, which is given by
.
The average path loss of the buoy–UAV and UAV–OBS links is figured out according to the probability of establishing the LoS, which can be expressed as
where
and
are the average path loss of the LoS and NLoS, respectively, which can be given by
where
,
is the wavelength,
and
are the excessive path loss for the LoS and NLoS paths, respectively, and
and
are the LoS and NLoS link factors, respectively.
According to the above discussions, the channel gain of the buoy–UAV and UAV–OBS links can be expressed as
The signal-to-noise-ratio (SNR) of the buoy–UAV and UAV–OBS at time slot
k is given by
where
is the Gaussian noise power and
and
are the transmit power of buoy
m and UAV
u at time slot
k, respectively. Moreover,
, and
is a constant.
2.2. Transmission Model
Let
denote the association indicator of the buoy–UAV and UAV–OBS.
if the buoy
m is associated with UAV
u at time slot
k. Otherwise,
. Similarly,
if UAV
u is associated with the OBS at time slot
k; otherwise,
. At time slot
k, each UAV can only collect the data from one buoy, and each buoy can only be associated with one UAV at most.
The system total spectrum bandwidth is
B and allocated to each buoy–UAV link and UAV–OBS link equally. Each buoy–UAV link and UAV–OBS link uses mutually independent and non-overlapping spectrum resources. Then, the spectrum bandwidth of each buoy–UAV link and UAV–OBS link at time slot
k can be given by
where
is the number of UAVs successfully associated with the buoys and OBS at time slot
k.
The transmission rate of the buoy–UAV and UAV–OBS at time slot
k is expressed as
In order to ensure that the data collection requirements of the buoys are satisfied, the following condition is considered:
where
denotes the data volume that needs to be collected from buoy
m. It is necessary to guarantee that all data collected by the UAVs are offload to the OBS. Moreover, the UAV that collects enough data is allowed to offload the data at time slot
k, even though the data in the buoys are not completely collected. The total data volume collected by UAV
u before time slot
k is expressed as
. The total data volume offload from UAV
u to the OBS before time slot
k is expressed as
, where
denotes the time slot in which the data in all buoys are collected completely, and
. Hence, each UAV should satisfy the following constraints:
In particular, even if UAV u does not have enough data, it can only offload the data to the OBS rather than at time slot .
2.3. Energy Consumption Model
The energy consumption of the UAV mainly includes propulsion energy and communication energy [
28]. The propulsion energy is to keep the UAV hovering and support the flight movement. The communication energy consumption is mainly caused by data transmission, signal processing, the communication circuit, etc. The order of magnitude of the propulsion energy is much larger than that of the communication energy. Hence, in this paper, we mainly considered the propulsion energy consumption and set the transmit power of the UAVs
to be a constant. Then, the communication energy caused by the data transmission from the UAVs to the OBS is
.
The propulsion power of UAV
u mainly depends on the flight velocity, given by
where
denotes the tip speed of the rotor blade,
is the mean rotor-induced velocity,
is the fuselage drag ratio,
and
are the blade profile power and induced power, respectively, and
,
l, and
f denote the air density, rotor solidity, and blade angular velocity, respectively. The propulsion energy consumption of UAV
u can be expressed as
Then, the total energy consumption of UAV u is given by , where and is the energy threshold of UAV u.
The energy consumption of the buoys is mainly generated by data transmission, expressed as , where , and is the energy threshold of buoy m.
2.4. Problem Formulation
Let
denote the associated variables,
denote the transmit power of the buoys,
denote the flight velocity of the UAVs, and
denote the flight angle of the UAVs. Our goal is to minimize the completion time of the data collection–offloading missions for all UAVs by the joint optimization of
,
,
, and
. Then, the total mission completion time minimization problem can be formulated as
In problem P1, C1 and C2 are the UAV maximum velocity and maximum angle constraints, respectively. C3 and C4 are the SNR constraints, where and are the threshold of the SNR of the buoy–UAV and UAV–OBS, respectively. C5 restricts the maximum transmit power of the buoys. C6 and C7 mean that the energy consumption of the UAVs and buoys should be smaller than their maximum energy. C8-C10 ensure that the UAVs keep apart from each other to avoid collision and do not fly beyond the target area. C11–C13 are the association constraints. C14–C16 are the data transmission constraints of the process of UAV data collection and offloading.
3. Joint Optimization Algorithm Design
In this section, due to the binary constraint (C11) and the non-convex constraints (C6, C7, C14–C16), it is difficult to solve the formulated mixed-integer non-convex problem P1 effectively and directly. Hence, we first propose a DRL-based algorithm to preliminarily determine the UAV trajectories, mission mode, and transmit power of the buoys. Then, we designed an SMP-based association algorithm (SAA) and an UAV–OBS association algorithm (UAA) to solve the buoy–UAV and UAV–OBS association subproblem, respectively.
3.1. MAHPPO Algorithm
As we all known, RL is based on the continuous interaction between the agent and the environment to select the best action for the observed state so as to maximize the cumulative reward [
18]. Each UAV in the multi-UAV system is controlled by a specialized agent, and all agents are deployed in the OBS. Each agent receives the status information obtained by the UAVs from the environment and then selects an action to obtain an immediate reward. Specifically, the agents send all the action information to the UAVs, and then, the UAVs forward the action information to the buoys. The action space
, state space
, and reward function
are designed as follows.
3.1.1. Action Space
The action space is denoted as . represents the action space of each agent, where is denoted as the mission mode selection of UAV u. If , the agent chooses to make UAV u collect data at time slot k, meaning that . Similarly, if , UAV u offloads the data at time slot k, meaning that . represents the transmit power of the buoy corresponding to UAV u at time slot k. Note that the value of could be obtained by the agent action selection, but we do not know which buoy is associated to UAV u at time slot k.
3.1.2. State Space
The state space is denoted as the global state , where the meanings of each variable are as follows:
is defined as the associated state of the buoy–UAV and UAV–OBS, where represents the number of buoys or OBS associated with UAV u. If UAV u is not associated with any buoy or OBS, it yields that .
and represent the set of remaining data volume for the buoys and UAVs, respectively.
and represent the set of remaining energy for the buoys and UAVs, respectively.
and represent the set of horizontal coordinates of the UAVs.
In MAHPPO, the agent action selection is not limited by the constraints, such as the SNR threshold, energy consumption, and so on. Therefore, it is necessary to make a series of constraint judgments on the action choices of the agents. If the selected action of the agent violates the constraint, the action will be changed to meet the constraint conditions. Therefore, we define as the changed mission mode selection of the UAVs and , where means that UAV u neither collects nor offloads data.
, where indicates the bandwidth occupied by UAV u at time slot k.
3.1.3. Reward Function
The following cases are possible during the multi-UAV mission performance process:
- (1)
If the mission is not completed, the UAV with sufficient energy will continue to collect or offload data.
- (2)
Although the mission is not completed, the UAV does not have sufficient energy. Hence, the UAV is forced to stop performing the mission at this time.
- (3)
The mission is successfully completed. Note that the UAVs must be offloading data in the last time slot.
Therefore, we designed the reward function as
where
is defined as the energy penalty.
is the position penalty when UAV
u violates constraints C8-C10. Furthermore, the UAV mission completion time is upper-bounded by
. If the UAVs complete the mission ahead of time, let
denote the time reward.
According to the reward function (18), P1 can be rewritten as
Since the action space contains two kinds of actions: a discrete variable
and three continuous variables
, and
, the MAHPPO algorithm [
26,
29] that can solve the hybrid discrete and continuous action space is employed. As shown in
Figure 2, agents send state information
to actor networks, and the number of actor networks is
U. These actor networks provide action policies
and
for the corresponding agents, where subscripts
d and
c represent the discrete and continuous parts of the actions, respectively.
is the parameter of the actor network for agent
u. For the convenience of expression, let
denote the discrete or continuous strategies. In an actor network, the output branches of
share multiple neural network layers. Since the MAHPPO algorithm adopts a random strategy, the discrete part of the actor network outputs the probability value of each discrete action. Moreover, the discrete action is sampled randomly based on the Softmax distribution. The continuous part of the actor network follows the parameterized Gaussian distribution and outputs the mean and variance of the continuous actions. The continuous part adopts the rule function in the last layer of the neural network.
RL is a process in which an agent constantly interacts with the environment to find the optimal policy to maximize the cumulative discount reward
, where
is a discount factor. The critic network outputs the state value function
, where the expectation of
is obtained with the policy
. The critic network is updated by minimizing the loss function, shown as
where
is the critic network parameter.
In order to prevent excessive differences in the policy updates of the actor networks, a clipped surrogate loss function was adopted [
30], which is expressed as
where
is the ratio of the current policy to the old policy and
represents the old policy.
is a hyperparameter.
is the advantage function estimated by the critic network, which is used to measure the advantage of action
in state
compared with the average action. The detailed definition of
is given by
In order to avoid falling into a suboptimal policy and encourage agents to explore, an entropy coefficient
is introduced into the loss function of the actor networks. Then, the loss function of the actor network is expressed as
where
is a hyperparameter.
The detailed flow of the MAHPPO algorithm is shown in Algorithm 1. In order to enhance the robustness of the algorithm, we normalized
and
to the range
. Moreover, the reward scaling technique was used for improving algorithm performance [
31].
3.2. SMP-Based Buoy–UAV Association Algorithm
Although the designed MAHPPO algorithm can determine the UAV mission modes, the association relationship between the buoys and UAVs is still unknown. In this subsection, we focus on addressing the association relationship problem. Given the transmit power of the buoys and the flight velocity and flight angle of the UAVs, P1 can be further written as
Since P3 is a mixed-integer non-convex problem, the SMP method is used to obtain the optimal solution in an iterative manner.
The SMP was first proposed in [
32], which contains two groups of elements, called the male set and female set. Everyone in the set has his/her own preference list for the opposite sex. According to the preference list, the current match was considered stable if there was not one male or female who rated the female or male in the other couples as better than the current partner in any two couples. The Gale–Shapley (GS) algorithm effectively solves the above problems. The main idea of the GS algorithm is to take the male’s point of view (either male or female) and let a male select a female according to his/her preference list. The female temporarily accepts the choice until a better man chooses her and then updates her choice. According to the above process, the two sets continuously cycle and update the selection until the match is stable.
Algorithm 1 Algorithm. |
- 1:
Initialize the actor network parameters and the critic network parameter . - 2:
Initialize the experience replay buffer with size D, mini-batch size , sample reuse time N, and state . - 3:
for episode = 1 to do - 4:
Initialize the number of total mission completion time slots , the energy penalty , and the position penalty . - 5:
while done is not True do - 6:
Every agent sample action with . - 7:
if the UAV u flies beyond the target area then - 8:
, cancel the actions of and , and update , based on the current state. - 9:
end if - 10:
if then - 11:
Let done be True, and - 12:
end if - 13:
Get the reward and next state . - 14:
Store experience data in . - 15:
. - 16:
if UAVs complete the mission or then - 17:
Let done be True. - 18:
end if - 19:
if is full then - 20:
Compute advantages and the state value . - 21:
for epoch = 1 to do - 22:
Sample from . - 23:
Compute of each actor network by (21). - 24:
Update each actor network by a gradient method . - 25:
Compute by (18). - 26:
Update critic network by . - 27:
end for - 28:
Clear the experience replay buffer . - 29:
end if - 30:
end while - 31:
end for
|
In this paper, the buoy–UAV association relationship problem can be regarded as a preference ordering problem, which is a standard SMP. The UAVs and buoys can be regarded as males and females, respectively. In contrast to the traditional SMP problem,
U and
M may not be equal. At time slot
k, the goal of minimizing the males’ cost can be regarded as maximizing the total amount of data collected by the UAVs. The amount of data collected is positively correlated with the SNR, so the preference value could be denoted by the SNR between the UAVs and buoys. According to the
obtained by the MAHPPO algorithm, the transmit power matrix of the buoys at time slot
k can be expressed as
Then, the preference matrix can be expressed as
It is unfair to match the association relationship directly according to the above preference matrix. For example, the distance between UAV u and buoy m is further than that between UAV and buoy m, but the agent u chooses higher transmit power to make , which causes buoy m to tend to be associated with UAV u and consume more energy of buoy m. Hence, the transmit power of all buoys was set to be the same.
Since the distances between the buoys and UAV u are different, there is no element value the same in any row. For buoy m, the distances between the UAVs and buoy m are also different due to the limit of , so there is no element value the same in any column, except in the case of . Hence, there must be and if there is in . The maximum SNR is defined as . Combined with the idea of the GS algorithm, we introduce Lemma 1.
Lemma 1. The buoy–UAV with in is the best choice for each.
Proof of Lemma 1. Suppose and and the preference of the first and second UAVs is and , respectively. The preferences of the three buoys are , , and . Hence, we have . First, UAV 1 and UAV 2 both choose Buoy 1 according to their own preferences. However, Buoy 1 is selected to be associated with UAV 2 because of . The second row and first column of are deleted to obtain . UAV 1 chooses the maximum value in , which means it chooses to be associated with Buoy 2, and so on. □
The above result could be also obtained by changing the goal to the cost of the buoys. Hence, the SAA algorithm achieves stable matching while obtaining the global optimal solution and is shown in Algorithm 2.
3.3. Joint Optimization Algorithm
After obtaining the association relationship between the buoys and UAVs, we aimed to determine the UAV–OBS association relationship. The considered association relationship subproblem of the UAV–OBS with the given
,
and
is formulated as
Although agents choose to offload the data, they cannot determine whether the actions satisfy the relevant constraints of subproblem P4. Hence, a heuristic algorithm of the UAV–OBS association is proposed to guarantee that the UAV’s actions satisfy these constraints. The detailed procedure is shown in Algorithm 3. In particular, lines 3–10 determine whether the current action selections
of the UAVs satisfy the relevant constraints and update the action selections
. In line 5, the bandwidth occupied by each UAV is not available due to the uncertainty of the UAV–OBS association. Hence, we take the maximum value
B satisfying constraints C15 and C16 when solving the problem P4.
Algorithm 2 SMP-based buoy-UAV association algorithm (SAA). |
- 1:
, , , , , , and . - 2:
Set . - 3:
fortodo - 4:
. - 5:
Let the transmit power of all buoys be , and delete from . - 6:
Obtain the preference matrix according to and . - 7:
Obtain the energy consumption according to . - 8:
Obtain the energy consumption of all buoys according to . - 9:
for to M do - 10:
if , or , , or does not satisfy C3 and C6 then - 11:
Set in , so as to obtain the new preference matrix . - 12:
end if - 13:
end for - 14:
for to do - 15:
Obtain the maximum SNR in . - 16:
if then - 17:
Obtain the UAV number u and buoy number m corresponding to , and . - 18:
Delete the UAV number u from , and set . - 19:
Set the row u and column m of equal to 0, and then, update . - 20:
else - 21:
break. - 22:
end if - 23:
end for - 24:
if then - 25:
break. - 26:
end if - 27:
end for - 28:
Let indicate the set of UAVs that are first transformed. - 29:
if UAV u is not associated with any buoy then - 30:
Set , and add the number of UAVs u to . - 31:
end if - 32:
if and then - 33:
Set , and . Then, add the number of UAVs u to . - 34:
end if - 35:
return, , .
|
In summary, the problem P1 could be solved effectively and efficiently with the proposed SAA, UAA, and MAHPPO algorithms. The detailed process is shown in Algorithm 4. Specifically, we first obtain
according to the actions of the agents at time slot
k in Algorithm 1. According to
, we divide the number of UAVs into two initial groups: collection
and offloading
. In line 8, we perform Algorithm 3 to judge
. In lines 11–24 of Algorithm 3, the UAV whose action does not satisfy the constraints is transformed to
, and its number is formed into a new set
. In lines 9–10, we update
by
. Then, we perform Algorithm 2 to judge
. Some UAVs in
may not satisfy the constraints of both data collection and data offloading after two judgments. Hence, in lines 32–34 of Algorithm 2, we set
if
and
. In lines 12–13, for UAVs that have only experienced the judgment of Algorithm 2 once, but did not satisfy the constraints, we perform Algorithm 3 to make the second judgment. In lines 22–24 of Algorithm 3, we set
if
and
. Hence, there may be some UAVs that neither collect, nor offload data in a certain time slot. After the above three judgments, the association relationship of all UAVs can be determined. The above procedure is looped until done is True.
Algorithm 3 UAV–OBS association algorithm (UAA). |
- 1:
Input , , , , , , and current channel gain . - 2:
repeat - 3:
Obtain the SNR of UAV–OBS according to . - 4:
Obtain the maximum transmission rate of UAV according to B. - 5:
Obtain the energy consumption of UAVs according to ,. - 6:
if , , and satisfy C4, C6, and C15, respectively then - 7:
Set , and . - 8:
else - 9:
Set - 10:
end if - 11:
if does not satisfy C15, and the data of all buoys are fully collected then - 12:
if and satisfy C4 and C6 then - 13:
Set and . - 14:
else - 15:
Set . - 16:
end if - 17:
end if - 18:
- 19:
until - 20:
Let indicate the set of UAVs that are first transformed. - 21:
if and then - 22:
Set and . - 23:
Add the number of UAVs u to . - 24:
end if - 25:
return, , .
|
Algorithm 4 Joint optimization algorithm based on SAA, UAA, and MAHPPO (SU-MAHPPO). |
- Input:
The initial positions of UAVs , the positions of buoys , and the position of OSB . - Output:
. - 1:
/* In Algorithm 1 */ - 2:
for episode = 1 to do - 3:
while done is not True do - 4:
Perform 6–9 of Algorithm 1. - 5:
Obtain according to . - 6:
Obtain initial and according to . - 7:
Let and . - 8:
Update , , and by performing Algorithm 3. - 9:
if is not null or exists in then - 10:
. - 11:
Update , , and by performing Algorithm 2. - 12:
if is not null then - 13:
. - 14:
Update , , and by performing Algorithm 3. - 15:
end if - 16:
end if - 17:
Perform 10–26 of Algorithm 1. - 18:
end while - 19:
end for
|
3.4. Complexity Analysis
The complexity of a neural network usually depends on the state dimension of the input layer, the number of neural network layers, and the number of neurons in the hidden layers and output layer. Since the MAHPPO algorithm contains U actor networks and a critic network, the complexity of the MAHPPO algorithm can be expressed as , where and represent the number of neural network layers of the actor and critic networks, respectively, and and represent the number of neurons in the l-th layer of the actor and critic networks, respectively. Since the maximum value of and is U, the highest algorithm complexity of Algorithm 2 and Algorithm 3 is and . Obviously, the complexity of Algorithm 2 is higher than that of Algorithm 3. In order to reduce the complexity of Algorithm 4, Algorithm 3 is executed with priority so that Algorithm 2 only needs to be executed once. In summary, the computational complexity of Algorithm 4 is .
4. Simulation Results and Discussion
In the simulations, we set a 5000 m × 5000 m target area and a 1500 m × 1500 m no-fly zone.
buoys with a data volume of 10 Mbits and a maximum transmit power
dBm were randomly distributed in the target area. The horizontal coordinates of the OBS were
m. The initial horizontal coordinates of the UAVs with the flight height
m were
m,
m, and
m, respectively. The maximum flight velocity and angle of the UAVs were
and
, respectively. The transmit power of each UAV was
W. The minimum safety distance between the UAVs was
m. The time slot duration was
s. For the MAHPPO algorithm, we set the length of the episode
,
. The actor network and the critic network both had three hidden layers with
neurons. The learning rates of the actor network and the critic network were
and
, respectively. The discount factor was
. The experience replay buffer size
D, mini-batch size
, and sample reuse time
N were set to 1024, 256, and 8, respectively. The hyperparameters were
and
. Moreover, our simulation was based on Python 3.8 with the Pytorch package. Other simulation parameters are shown in
Table 2.
Furthermore, we compared the proposed SU-MAHPPO algorithm with three baseline algorithms, which are listed as follows:
SU-MAPDQN algorithm: This algorithm refers to using the multi-agent PDQN (MAPDQN) algorithm to replace the MAHPPO algorithm in the algorithm proposed in our paper.
SU-MAPPO algorithm: This algorithm refers to using the multi-agent PPO (MAPPO) algorithm to replace the MAHPPO algorithm in the algorithm proposed in this paper. Furthermore, the continuous action space of the UAVs is discretized in this algorithm. The flight angle of UAV u at time slot k is expressed in four directions: up, down, left, and right, i.e., . The flight velocity of the UAV at time slot k is expressed as m/s.
MAHPPO algorithm: In this algorithm, the association relationships of the buoy–UAV and UAV–OBS are optimized. Hence, the action space of each agent of this algorithm is denoted as . The state space is denoted as , where represents the discrete actions of all agents. The MAHPPO algorithm is shown in Algorithm 5.
Algorithm 5 MAHPPPO algorithm. |
- 1:
/* In Algorithm 1 */ - 2:
Perform 1–6 of Algorithm 1. - 3:
Obtain according to the discrete actions of . - 4:
Let - 5:
fortoUdo - 6:
for to U do - 7:
if and then - 8:
if then - 9:
Set . - 10:
end if - 11:
end if - 12:
if does not satisfy C3, C4, C6, C7, and C15 then - 13:
Set . - 14:
end if - 15:
end for - 16:
end for - 17:
Perform 7–31 of Algorithm 1.
|
As shown in
Figure 3, we first compared the accumulative reward of SU-MAHPPO with the other three algorithms. The curves were smoothed for the convenience of observation. We can see that the proposed SU-MAHPPO algorithm was significantly better than the other algorithms. The SU-MAHPPO scheme could be stable and convergent after 10,000 episodes. Although the SU-MAPPO algorithm can quickly converge, the accumulated reward was lower than the SU-MAHPPO algorithm, and the curve suddenly dropped after 15,000 episodes of training. Moreover, the SU-MAPDQN algorithm cannot be convergent after 18,000 episodes. This is because the continuous action values corresponding to each value of discrete actions are output in MAPDQN, which obviously increases the computational complexity. Furthermore, the performance of the MAHPPO algorithm was far lower than that of our proposed algorithm within 18,000 episodes. This is because the complex action space and state space make it more difficult for the agents to explore. This confirms the advantages of the combination of the SAA algorithm and UAA algorithm to narrow the exploration domain of the agents.
Figure 4 shows the effectiveness of the technologies mentioned in
Section 3.1. It can be seen that the proposed SU-MAHPPO algorithm without reward scaling had difficulty converging and was more unstable within 18,000 episodes. The reason is that the reward scaling is beneficial to the fitting of the value function. The curve of the SU-MAHPPO algorithm without state and action normalization shows an upward trend, but it also cannot converge within 18,000 episodes. This is because the order of magnitude difference between the parameters in the state
and action space
is large, which is not conducive to the training of neural networks. Reward scaling is beneficial to the fitting of the value function. Hence, the above two techniques can effectively enhance the convergence and stability of the proposed algorithm.
Figure 5 shows the UAV trajectories’ comparison between the SU-MAHPPO algorithm and SU-MAPPO algorithm. The mission completion time of SU-MAHPPO and SU-MAPPO was 53 s and 67 s, respectively. SU-MAHPPO is better, since the designed reward function is positively related to the data transmission rate in each time slot. Moreover, the continuous action space of the SU-MAHPPO algorithm allows the agents to more accurately control the positions of the UAVs in each time slot. Besides, the continuous action space makes it more accurately adjust the transmit power of the buoys to maximize the accumulative reward. The discretized action space makes it difficult for the UAVs to reach the optimal trajectory for collecting and offloading data. Furthermore, it can be seen that the UAV trajectories of the proposed algorithm were not close to the buoy or OBS. This is because the UAVs can change the mission mode in different time slots. Take the trajectory of the UAV close to the buoy for example. In this case, the time of data collection will be reduced, but the UAV will be far away from the OBS, which makes it difficult to meet the SNR threshold of data offloading. Hence, the UAV needs more time to offload the data, so that the total mission completion time is longer.
Figure 6 shows the trajectories of the three groups of UAVs taking off from different initial positions. The initial positions of the first group of UAVs are the same as the default positions, namely Position 1. The initial positions of the second group of UAVs are (1000, 1000), (1500, 2000), and (3000, 2000), namely Position 2. The initial positions of the third group of UAVs are (1500, 1500), (1500, 3000), and (3000, 1500), namely Position 3. It can be seen that the UAVs can avoid the no-fly zone even if they take off at the edge of the no-fly zone. The performance advantage benefits what we set as the position penalty in the reward function, which can guide the agent to control the UAVs flying in the target area.
Figure 7 shows the comparison of the multi-UAV trajectories based on the SU-MAPPO algorithm with different SNR thresholds of data collection. It can be seen that the UAV trajectories were closer to the buoys with the growth of the SNR threshold. This is because the designed reward function is mainly related to the transmission rate and the time reward
. With the SNR thresholds increasing, the UAV needs to be close enough to the buoy to achieve the minimum transmission rate requirement. The agent chooses to let the UAV be closer to the buoy to enhance the data collection transmission rate, which contributes to shortening the data collection time and obtaining a larger value of
.
Figure 8 shows the total mission completion time of the SU-MAHPPO algorithm and the SU-MAPPO algorithm with different SNR thresholds of data collection.
Figure 9 and
Figure 10 show the total mission completion time of the SU-MAHPPO algorithm and the SU-MAPPO algorithm with different channel conditions. It can be seen that the performance of our proposed algorithm is significantly better than the SU-MAPPO algorithm. The reason is that the discretized action space makes the SU-MAPPO algorithm unable to explore and select the optimal action in the considered environment, which further verifies the advantage of the proposed algorithm in the continuous action space.