Adaptive Data Collection and Offloading in Multi-UAV-Assisted Maritime IoT Systems: A Deep Reinforcement Learning Approach

Liang, Ziyi; Dai, Yanpeng; Lyu, Ling; Lin, Bin

doi:10.3390/rs15020292

Open AccessArticle

Adaptive Data Collection and Offloading in Multi-UAV-Assisted Maritime IoT Systems: A Deep Reinforcement Learning Approach

by

Ziyi Liang

¹,

Yanpeng Dai

^1,*,

Ling Lyu

¹

and

Bin Lin

^1,2

¹

School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China

²

Peng Cheng Laboratory, Shenzhen 518052, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(2), 292; https://doi.org/10.3390/rs15020292

Submission received: 9 November 2022 / Revised: 21 December 2022 / Accepted: 29 December 2022 / Published: 4 January 2023

(This article belongs to the Special Issue Satellite and UAV for Internet of Things (IoT))

Download

Browse Figures

Versions Notes

Abstract

:

This paper studies the integration of data collection and offloading for maritime Internet of Things (IoT) systems with multiple unmanned aerial vehicles (UAVs). In the considered multi-UAV maritime IoT system, the UAVs act as the aerial base stations to complete the missions of data collection from buoys and data offloading to the offshore base station (OBS). In this case, the UAVs need to adaptively select the mission mode between data collection and data offloading according to the network resources and mission requirements. In this paper, we aimed to minimize the completion time of data collection and offloading missions for all UAVs by jointly optimizing the UAV trajectories, mission mode selection, transmit power of buoys, and association relationships between the UAVs and buoy/OBS. In order to solve the mixed-integer non-convex minimization problem, we first designed a multi-agent deep reinforcement learning algorithm based on a hybrid discrete and continuous action space to preliminarily obtain the UAV trajectories, mission mode selection, and the transmit power of buoys. Then, we propose an algorithm based on the stable marriage problem to determine the buoy–UAV and UAV–OBS association relationships. Finally, the simulation results show that the proposed algorithms can effectively shorten the total mission completion time of data collection and offloading for the multi-UAV-assisted maritime IoT system.

Keywords:

maritime IoT; deep reinforcement learning; UAV trajectory optimization; hybrid proximal policy optimization

Graphical Abstract

1. Introduction

The development of the maritime Internet of Things (IoT) facilitates various maritime activities of humans more frequently and greatly promotes the development of the maritime economy. The surging demand for wireless communication puts great pressure on maritime wireless communication systems for the next generation of wireless communication technologies (such as beyond Fifth-Generation (5G) and Six-Generation (6G)) [1,2]. However, maritime wireless communication technology is relatively behind compared with that for land [3]. For example, the maritime wireless communication system lacks the construction conditions of communication facilities (such as base stations), and the maritime spectrum resources are scarce [4]. Besides, complex sea surface changes (such as wave motion and evaporation duct) make the maritime wireless communication environment unstable [5]. Hence, it is of great significance to research the maritime IoT to realize flexible, efficient, and reliable transmission.

In the maritime IoT system, buoys embedded with a variety of sensors are usually deployed on the sea surface to monitor the maritime environment. However, these buoys are generally out of the service coverage of the offshore base stations (OBSs). Although monitoring data can be transmitted through the multi-hop network formed by buoys and ships, the connectivity of the network will be seriously damaged when nodes in the network fail [6]. Satellite communication could mitigate the issue of network failure, but it has fatal drawbacks in terms of high delay, low-link reliability, and high cost [7]. The unmanned-aerial-vehicle (UAV)-assisted maritime communication system is regarded as a promising solution to deal with the high delay and low reliability. The UAV as the aerial mobile base station has the following advantages in assisting maritime wireless communication. On the one hand, the high maneuverability of the UAV makes it easier to establish line-of-sight (LoS) communication link between UAVs and buoys and also allows them to flexibly approach buoys to enhance the communication links [8]. Hence, the UAV can better adapt to the complex maritime wireless communication environment and is very suitable for emergency communication scenarios. On the other hand, the UAV usually has the advantages of a small size, low cost, and easy control [9]. The application of UAVs can reduce the risk of manual operation in some dangerous maritime environments. The above advantages also make UAVs have high concealment and be widely used in maritime military activities.

However, the size of the batteries embedded in UAVs and buoys is limited. It will affect the efficiency of UAV’s mission execution, if UAVs frequently return to the charging area. Moreover, frequent battery replacement will increase the maintenance cost of buoys, and in fact, it difficult to achieve due to the bad condition of the sea environment. Hence, the performance of UAV-assisted maritime wireless communication is severely limited by the energy consumption of UAVs and buoys [10]. The energy consumption disadvantage could be compensated by multi-UAV cooperative data transmission. The multi-UAV communication system has a wider communication range and a shorter transmission distance, which can reduce the communication delay and improve the energy efficiency [11]. In the multi-UAV communication system, UAVs need to collect the data cached in buoys and offload the data to the OBS. In this process, the UAV will be far away from the OBS, which may make it take a long time to offload the data. Hence, it is important to design the trajectories of multiple UAVs to ensure the data transmission requirement under the condition of limited energy and shorten the mission completion time of data collection and data offloading.

Recently, there has been extensive research work focused on the multi-UAV-assisted wireless communication scenarios. Diao et al. [12] solved the multi-UAV deployment problem by a circular deployment scheme and jointly optimized the offloading and scheduling strategies to minimize the weighted sum of transmission and hover energy consumption. Kuo et al. [13] studied multi-UAV-assisted data collection scenarios with circular trajectories and considered the deployment of UAVs and device association to minimize the total energy consumption of IoT devices. Gao et al. [14] jointly optimized UAV trajectories, ground equipment scheduling, and power allocation by using the technologies of block coordinate descent and successive convex approximation to maximize the overall throughput of ground devices. However, the above work only considered the uplink or downlink transmission scenarios. Hua et al. [15] studied the simultaneous uplink and downlink transmission systems of multiple UAVs, in which one UAV is responsible for downlink transmission and the other UAV is responsible for uplink transmission. Furthermore, Liu et al. [16] studied that each UAV can dynamically change the transmission mode in each time slot, but the premise of this work is to know the UAVs’ trajectories. It is significant that UAVs can change the mission mode according to the transmission requirements. For example, a UAV adopts the mode of offloading after completing the collection mission. When the UAV fails during the collection mission, all the data collected by the UAV cannot be returned to the OBS, resulting in data loss. The high coupling of UAV trajectories, association relationships, and mission mode is a challenge for us, which can usually be described as a mixed-integer non-convex problem. The above works effectively solved this kind of problem through traditional convex optimization techniques and iterative methods, but they have high computational complexity and the curse of dimensionalitywith the increase of the action and state dimension of the system due to the high dynamics of UAVs.

With the development of machine learning technology, deep reinforcement learning (DRL) algorithms have received extensive attention with the advantage of dealing with highly dynamic environments [17,18,19]. In reinforcement learning, the agent maximizes the cumulative reward by constantly exploring the environment. DRL combines the advantages of the deep neural network (DNN) and can handle more complex actions and states. At present, much research has applied DRL algorithms to multi-UAV trajectory optimization. References [20,21,22] studied the multi-UAV trajectory optimization scheme based on the multi-agent deep deterministic policy gradient (MADDPG) algorithm and regarded the UAVs as the agents. Wu et al. [20] aimed to minimize the average age-of-information (AoI) and proposed an MADDPG algorithm to jointly optimize the UAV sensing location and transmission location. However, the MADDPG algorithm can only handle a continuous action space. Hence, Wang et al. [21] combined a low-complexity mission decision-making method with the MADDPG algorithm. Gao et al. [22] used a potential game approach to determine the service allocation between users and multiple UAVs, which effectively solved the mixed-integer non-linear problem. For using DRL algorithms to process the hybrid discrete and continuous action space, it undoubtedly reduces the flight accuracy of UAVs if a continuous action is discretized, and this is inconsistent with the actual situation of UAV flight actions; relaxing the discrete action space into a continuous action space increases the complexity of the action space, such as in Hausknecht et al. [23]. Xiong et al. [24] proposed the parameterized deep Q-network learning (PDQN) algorithm based on the parameterized action space. Yin et al. [25] proposed a multi-agent parametrized deep Q-network (MAPDQN) algorithm to maximize both the overall throughput and the fairness throughput. However, the action decision-making method of the PDQN algorithm may reduce the maximum Q-value [26]. Then, Fan et al. [26] proposed a hybrid proximal policy optimization (HPPO) algorithm, which can efficiently solve the hybrid action space based on the proximal policy optimization (PPO) algorithm. We summarize the difference between our work and the existing related references in Table 1.

Based on the above discussion, this paper studies the multi-UAV-assisted maritime IoT systems, in which the missions of the UAVs consist of collecting data from buoys and offloading the data to the OBS. The UAVs could adaptively change the mission mode according to the network resources and mission requirements. Our goal was to minimize the total mission completion time by jointly optimizing the UAV trajectories, mission mode selection, the transmit power of the buoys, and the buoy–UAV and UAV–OBS association relationships with the constraints of energy consumption and a no-fly zone. The main contributions of our paper are summarized as follows:

We propose an adaptive data transmission mission mode framework for the multi-UAV-assisted maritime IoT system. Specifically, UAVs are allowed to change the mission mode in each time slot according to the channel quality and data volume, which improves the flexibility of offshore data collection and offloading.
A multi-UAV trajectory optimization algorithm based on multi-agent HPPO (MAHPPO) is proposed to overcome the problem of a hybrid discrete and continuous action space. The total mission completion time is effectively shortened through the flexible scheduling of the UAV trajectories, mission mode, and transmit power of buoys.
In order to reduce the exploration difficulty and action space dimension of the MAHPPO algorithm, we propose an algorithm based on the stable marriage problem (SMP) to determine the buoy–UAV and UAV–OBS association relationships. The simulation results show that the proposed algorithm can effectively reduce the total mission completion time with low computational complexity.

The remainder of the paper is organized as follows. The multi-UAV data collection and offloading model and the problem formulation are given in Section 2. Then, the problem analysis and the proposed algorithm are given in Section 3. The simulation results are given in Section 4. Finally, we give a conclusion in Section 5.

2. System Model and Problem Formulation

As shown in Figure 1, we considered a UAV-assisted maritime IoT system where U UAVs with the same fixed flight height H were used as the aerial base stations to execute the data collection–offloading mission in the target area. The mission of the UAVs is to collect the hydrometeorological data sensed by M buoys with a random distribution in the target area and offload all the collected data to the OBS. The sets of UAVs and buoys are denoted as

u \in U = \{1, 2, \dots, U\}

and

m \in M = \{1, 2, \dots, M\}

, respectively. The total mission completion time of the UAVs is divided into multiple equal time slots and is denoted as

T_{total} = K δ

, where K is the number of time slots,

k \in K = \{\infty, \in, \dots, K\}

is the set of time slots, and

δ

is the duration of each time slot. Note that, at each time slot, the UAVs need to choose the appropriate operation mode between data collection and data offloading according to the current network states. In each time slot, the UAV u cannot collect and offload data at the same time.

The horizontal coordinate of UAV u at time slot k is denoted as

q_{u, k} = \{x_{u, k}, y_{u, k}\}, \forall u \in U, k \in K

. The flight direction of UAV u is limited by an angle of

ψ_{u, k} \in [0, 2 π)

and a velocity of

V_{u, k} \in [0, V_{max}]

, where

V_{max}

is the maximum flight velocity of the UAV. The coordinates of UAV u can be expressed as

x_{u, k} = x_{u, 0} + \sum_{k^{*} = 1}^{k} V_{u, k^{*}} δ cos (ψ_{u, k^{*}})

and

y_{u, k} = y_{u, 0} + \sum_{k^{*} = 1}^{k} V_{u, k^{*}} δ sin (ψ_{u, k^{*}})

, respectively. Considering the limitation of the flight range,

x_{u, k}

and

y_{u, k}

are respectively constrained by

0 \leq x_{u, k} \leq x_{max}, \forall u \in U, k \in K

(1)

0 \leq y_{u, k} \leq y_{max}, \forall u \in U, k \in K

(2)

where

x_{max}

and

y_{max}

are the length and width of the target area, respectively. Moreover, the distance between UAV u and UAV

u^{*}

at == time slot k can be expressed as

d_{u, u^{*}, k} = \sqrt{{(x_{u, k} - x_{u^{*}, k})}^{2} + {(y_{u, k} - y_{u^{*}, k})}^{2}}

. In order to avoid a collision, it is necessary to ensure the minimum safety distance

d^{min}

between UAVs, which is expressed as

d_{u, u^{*}, k} \geq d^{min} . \forall u, u^{*} \in U, u \neq u^{*}, k \in K

(3)

Denote the set

i \in I = \{0, 1, \dots, M\}

as the set of OBS and buoys, where

i = 0

means the OBS. The horizontal coordinate of buoy m and the OBS is

(x_{i}, y_{i})

. Hence, the distance of the buoy–UAV and UAV–OBS at time slot k can be given by

d_{u, i, k} = \sqrt{{(x_{u, k} - x_{i})}^{2} + {(y_{u, k} - y_{i})}^{2} + H^{2}}

.

2.1. Channel Model

Due to the special maritime propagation, we adopted the model of the air-to-ground channel and the two-ray path loss model [27]. More specifically, the probability of the LoS link is expressed as

P_{u, i, k}^{LoS} = \frac{1}{1 + a exp (- b (ψ_{u, i, k} - a))}

, where a and b are two constant values depending on the environment.

ψ_{u, i, k}

denotes the elevation angle of the buoy–UAV and UAV–OBS, which is given by

ψ_{u, i, k} = \frac{18 0^{^{\circ}}}{π} \times arcsin (\frac{H}{\sqrt{{(x_{u, k} - x_{i})}^{2} + {(y_{u, k} - y_{i})}^{2}}})

.

The average path loss of the buoy–UAV and UAV–OBS links is figured out according to the probability of establishing the LoS, which can be expressed as

L_{u, i, k} = P_{u, i, k}^{LoS} L_{u, i, k}^{LoS} + (1 - P_{u, i, k}^{LoS}) L_{u, i, k}^{NLoS}, \forall u \in U, i \in I, k \in K

(4)

where

L_{u, i, k}^{LoS}

and

L_{u, i, k}^{NLoS}

are the average path loss of the LoS and NLoS, respectively, which can be given by

L_{u, i, k}^{LoS} = {(\frac{4 π d_{u, i, k}}{λ})}^{υ^{LoS}} μ_{u, i, k} ξ_{LoS},

(5)

L_{u, i, k}^{NLoS} = {(\frac{4 π d_{u, i, k}}{λ})}^{υ^{NLoS}} μ_{u, i, k} ξ_{NLoS},

(6)

where

μ_{i, k} = 1

,

λ

is the wavelength,

ξ_{LoS}

and

ξ_{NLoS}

are the excessive path loss for the LoS and NLoS paths, respectively, and

υ^{LoS}

and

υ^{NLoS}

are the LoS and NLoS link factors, respectively.

According to the above discussions, the channel gain of the buoy–UAV and UAV–OBS links can be expressed as

h_{u, i, k} = 1 / L_{u, i, k} . \forall u \in U, i \in I, k \in K

(7)

The signal-to-noise-ratio (SNR) of the buoy–UAV and UAV–OBS at time slot k is given by

g_{u, i, k} = \frac{P_{u, i, k} h_{u, i, k}}{σ^{2}}, \forall u \in U, i \in I, k \in K

(8)

where

σ^{2}

is the Gaussian noise power and

P_{u, m, k}, \forall m \in M

and

P_{u, 0, k}

are the transmit power of buoy m and UAV u at time slot k, respectively. Moreover,

0 \leq P_{u, m, k} \leq P_{max}

, and

P_{u, 0, k}

is a constant.

2.2. Transmission Model

Let

α_{u, i, k}, \forall u \in U, i \in I, k \in K

denote the association indicator of the buoy–UAV and UAV–OBS.

α_{u, m, k} = 1, \forall m \in M

if the buoy m is associated with UAV u at time slot k. Otherwise,

α_{u, m, k} = 0, \forall m \in M

. Similarly,

α_{u, 0, k} = 1

if UAV u is associated with the OBS at time slot k; otherwise,

α_{u, 0, k} = 0

. At time slot k, each UAV can only collect the data from one buoy, and each buoy can only be associated with one UAV at most.

α_{u, i, k} \in \{0, 1\}, \forall u \in U, i \in I, k \in K

(9)

\sum_{i = 0}^{M} α_{u, i, k} \leq 1, \forall u \in U, k \in K

(10)

\sum_{u = 1}^{U} α_{u, m, k} \leq 1 . \forall m \in M, k \in K

(11)

The system total spectrum bandwidth is B and allocated to each buoy–UAV link and UAV–OBS link equally. Each buoy–UAV link and UAV–OBS link uses mutually independent and non-overlapping spectrum resources. Then, the spectrum bandwidth of each buoy–UAV link and UAV–OBS link at time slot k can be given by

B_{k} = \frac{B}{{U_{k}}^{*}},

(12)

where

{U_{k}}^{*} \leq U

is the number of UAVs successfully associated with the buoys and OBS at time slot k.

The transmission rate of the buoy–UAV and UAV–OBS at time slot k is expressed as

R_{u, i, k} = B_{k} α_{u, i, k} {log}_{2} (1 + g_{u, i, k}) . \forall u \in U, i \in I, k \in K

(13)

In order to ensure that the data collection requirements of the buoys are satisfied, the following condition is considered:

\sum_{u = 1}^{U} \sum_{k = 1}^{K} R_{u, m, k} \geq C_{m}, \forall m \in M

(14)

where

C_{m}

denotes the data volume that needs to be collected from buoy m. It is necessary to guarantee that all data collected by the UAVs are offload to the OBS. Moreover, the UAV that collects enough data is allowed to offload the data at time slot k, even though the data in the buoys are not completely collected. The total data volume collected by UAV u before time slot k is expressed as

C_{u, k - 1}^{co} = \sum_{k^{*} = 1}^{k - 1} \sum_{m = 1}^{M} R_{u, m, k^{*}} δ

. The total data volume offload from UAV u to the OBS before time slot k is expressed as

C_{u, k - 1}^{of} = \sum_{k^{*} = 1}^{k - 1} R_{u, 0, k^{*}} δ

, where

k^{'}

denotes the time slot in which the data in all buoys are collected completely, and

k^{'} < K

. Hence, each UAV should satisfy the following constraints:

C_{u, k - 1}^{co} - C_{u, k - 1}^{of} \geq R_{u, 0, k} δ, \forall u \in U, k \in \{1, 2, \dots, k^{'}\}

(15)

\sum_{u = 1}^{U} \sum_{k = 1}^{K} R_{u, 0, k} δ \geq \sum_{m = 1}^{M} C_{m} .

(16)

In particular, even if UAV u does not have enough data, it can only offload the data to the OBS rather than at time slot

k^{'}

.

2.3. Energy Consumption Model

The energy consumption of the UAV mainly includes propulsion energy and communication energy [28]. The propulsion energy is to keep the UAV hovering and support the flight movement. The communication energy consumption is mainly caused by data transmission, signal processing, the communication circuit, etc. The order of magnitude of the propulsion energy is much larger than that of the communication energy. Hence, in this paper, we mainly considered the propulsion energy consumption and set the transmit power of the UAVs

P_{u, 0, k}

to be a constant. Then, the communication energy caused by the data transmission from the UAVs to the OBS is

E_{u}^{cm} = \sum_{k = 1}^{K} α_{u, 0, k} P_{u, 0, k} δ

.

The propulsion power of UAV u mainly depends on the flight velocity, given by

P_{u} (V_{u, k}) = P_{o} (1 + \frac{3 V_{u, k}^{2}}{U_{tip}^{2}}) + P_{i} {(\sqrt{1 + \frac{V_{u, k}^{4}}{4 v_{o}^{4}}} - \frac{V_{u, k}^{2}}{2 v_{o}^{2}})}^{1 / 2} + \frac{1}{2} d_{o} ρ l f V_{u, k}^{3},

(17)

where

U_{tip}

denotes the tip speed of the rotor blade,

v_{o}

is the mean rotor-induced velocity,

d_{o}

is the fuselage drag ratio,

P_{o}

and

P_{i}

are the blade profile power and induced power, respectively, and

ρ

, l, and f denote the air density, rotor solidity, and blade angular velocity, respectively. The propulsion energy consumption of UAV u can be expressed as

E_{u}^{pr} = \sum_{k = 1}^{K} P (V_{u, k}) δ .

(18)

Then, the total energy consumption of UAV u is given by

E_{u}^{total} = E_{u}^{cm} + E_{u}^{pr}

, where

E_{u}^{total} \leq {\bar{E}}_{u}

and

{\bar{E}}_{u}

is the energy threshold of UAV u.

The energy consumption of the buoys is mainly generated by data transmission, expressed as

E_{m} = \sum_{u = 1}^{U} \sum_{k = 1}^{K} α_{u, m, k} P_{u, m, k} δ

, where

E_{m} \leq {\bar{E}}_{m}

, and

{\bar{E}}_{m}

is the energy threshold of buoy m.

2.4. Problem Formulation

Let

A = {α_{u, i, k}, \forall u, i, k}

denote the associated variables,

P = {P_{u, m, k}, \forall u, m, k}

denote the transmit power of the buoys,

V = {V_{u, k}, \forall u, k}

denote the flight velocity of the UAVs, and

ψ = {ψ_{u, k}, \forall u, k}

denote the flight angle of the UAVs. Our goal is to minimize the completion time of the data collection–offloading missions for all UAVs by the joint optimization of

A

,

P

,

V

, and

ψ

. Then, the total mission completion time minimization problem can be formulated as

\begin{matrix} (P 1) : min_{A, P, V, ψ} & T_{total} \\ s . t . & C 1 : 0 \leq V_{u, k} \leq V_{max}, \forall u \in U, k \in K \\ C 2 : 0 \leq ψ_{u, k} < 2 π, \forall u \in U, k \in K \\ C 3 : g_{u, m, k} \geq {\bar{g}}_{co}, \forall u \in U, m \in M, k \in K \\ C 4 : g_{u, 0, k} \geq {\bar{g}}_{of}, \forall u \in U, k \in K \\ C 5 : 0 \leq P_{u, m, k} \leq P_{max}, \\ \forall u \in U, m \in M, k \in K \\ C 6 : E_{u}^{total} \leq {\bar{E}}_{u}, \forall u \in U \\ C 7 : E_{m} \leq {\bar{E}}_{m}, \forall m \in M \\ C 8 - C 16 : (1) - (3), (9) - (11), (14) - (16) . \end{matrix}

In problem P1, C1 and C2 are the UAV maximum velocity and maximum angle constraints, respectively. C3 and C4 are the SNR constraints, where

{\bar{g}}_{co}

and

{\bar{g}}_{of}

are the threshold of the SNR of the buoy–UAV and UAV–OBS, respectively. C5 restricts the maximum transmit power of the buoys. C6 and C7 mean that the energy consumption of the UAVs and buoys should be smaller than their maximum energy. C8-C10 ensure that the UAVs keep apart from each other to avoid collision and do not fly beyond the target area. C11–C13 are the association constraints. C14–C16 are the data transmission constraints of the process of UAV data collection and offloading.

3. Joint Optimization Algorithm Design

In this section, due to the binary constraint (C11) and the non-convex constraints (C6, C7, C14–C16), it is difficult to solve the formulated mixed-integer non-convex problem P1 effectively and directly. Hence, we first propose a DRL-based algorithm to preliminarily determine the UAV trajectories, mission mode, and transmit power of the buoys. Then, we designed an SMP-based association algorithm (SAA) and an UAV–OBS association algorithm (UAA) to solve the buoy–UAV and UAV–OBS association subproblem, respectively.

3.1. MAHPPO Algorithm

As we all known, RL is based on the continuous interaction between the agent and the environment to select the best action for the observed state so as to maximize the cumulative reward [18]. Each UAV in the multi-UAV system is controlled by a specialized agent, and all agents are deployed in the OBS. Each agent receives the status information obtained by the UAVs from the environment and then selects an action to obtain an immediate reward. Specifically, the agents send all the action information to the UAVs, and then, the UAVs forward the action information to the buoys. The action space

A

, state space

S

, and reward function

R

are designed as follows.

3.1.1. Action Space

The action space is denoted as

A = \{A_{u, k}, \forall u \in U\}

.

A_{u, k} = \{χ_{u, k}, ψ_{u, k}, V_{u, k}, P_{u, k}\}

represents the action space of each agent, where

χ_{u, k} = {0, 1}

is denoted as the mission mode selection of UAV u. If

χ_{u, k} = 0

, the agent chooses to make UAV u collect data at time slot k, meaning that

\sum_{m = 1}^{M} α_{u, m, k} = 1

. Similarly, if

χ_{u, k} = 1

, UAV u offloads the data at time slot k, meaning that

α_{u, 0, k} = 1

.

P_{u, k}

represents the transmit power of the buoy corresponding to UAV u at time slot k. Note that the value of

P_{u, k}

could be obtained by the agent action selection, but we do not know which buoy is associated to UAV u at time slot k.

3.1.2. State Space

The state space is denoted as the global state

S_{k} = \{κ_{k}, c_{k}^{buoy}, E_{k}^{buoy}, x_{k}, y_{k}, c_{k}^{uav}, E_{k}^{uav}, χ_{k}, χ_{k}^{new}, B_{k}\}

, where the meanings of each variable are as follows:

$κ_{k} = \{κ_{u, k}, \forall u \in U\}$ is defined as the associated state of the buoy–UAV and UAV–OBS, where $κ_{u, k} = i, \forall i \in I$ represents the number of buoys or OBS associated with UAV u. If UAV u is not associated with any buoy or OBS, it yields that $κ_{u, k} = - 1$ .
$c_{k}^{buoy} = \{c_{m, k}^{buoy}, \forall m \in M\}$ and $c_{k}^{uav} = \{c_{u, k}^{uav}, \forall u \in U\}$ represent the set of remaining data volume for the buoys and UAVs, respectively.
$E_{k}^{buoy} = \{E_{m, k}^{buoy}, \forall m \in M\}$ and $E_{k}^{uav} = \{E_{u, k}^{uav}, \forall u \in U\}$ represent the set of remaining energy for the buoys and UAVs, respectively.
$x_{k} = \{x_{u, k}, \forall u \in U\}$ and $y_{k} = \{y_{u, k}, \forall u \in U\}$ represent the set of horizontal coordinates of the UAVs.
In MAHPPO, the agent action selection is not limited by the constraints, such as the SNR threshold, energy consumption, and so on. Therefore, it is necessary to make a series of constraint judgments on the action choices of the agents. If the selected action of the agent violates the constraint, the action will be changed to meet the constraint conditions. Therefore, we define $χ_{k}^{new} = \{χ_{u, k}^{new}, \forall u \in U\}$ as the changed mission mode selection of the UAVs and $χ_{u, k}^{new} = \{χ_{u, k}, 2\}$ , where $χ_{u, k}^{new} = 2$ means that UAV u neither collects nor offloads data.
$B_{k} = \{B_{u, k}, \forall u \in U\}$ , where $B_{u, k}$ indicates the bandwidth occupied by UAV u at time slot k.

3.1.3. Reward Function

The following cases are possible during the multi-UAV mission performance process:

(1): If the mission is not completed, the UAV with sufficient energy will continue to collect or offload data.
(2): Although the mission is not completed, the UAV does not have sufficient energy. Hence, the UAV is forced to stop performing the mission at this time.
(3): The mission is successfully completed. Note that the UAVs must be offloading data in the last time slot.

Therefore, we designed the reward function as

\begin{matrix} r_{k} = \{\begin{matrix} \sum_{u = 1}^{U} (α_{u, 0, k} R_{u, 0, k} δ + ρ_{u, k}) + K^{*}, & Case 3 \\ \sum_{u = 1}^{U} (\sum_{i = 0}^{I} α_{u, i, k} R_{u, i, k} δ + ρ_{u, k}) + E_{pe}, & Case 1 & 2 \end{matrix} \end{matrix}

(19)

where

E_{pe}

is defined as the energy penalty.

ρ_{u, k}

is the position penalty when UAV u violates constraints C8-C10. Furthermore, the UAV mission completion time is upper-bounded by

K^{\max}

. If the UAVs complete the mission ahead of time, let

K^{*} = K^{max} - K

denote the time reward.

According to the reward function (18), P1 can be rewritten as

\begin{matrix} (P 2) : max_{A, P, V, ψ} & \sum_{k = 1}^{K} \sum_{i = 0}^{M} R_{u, i, k} \\ s . t . C 1 - C 16 . \end{matrix}

(20)

Since the action space contains two kinds of actions: a discrete variable

χ_{u, k}

and three continuous variables

ψ_{u, k}, V_{u, k}

, and

P_{u, k}

, the MAHPPO algorithm [26,29] that can solve the hybrid discrete and continuous action space is employed. As shown in Figure 2, agents send state information

S_{k}

to actor networks, and the number of actor networks is U. These actor networks provide action policies

π_{θ_{u}} (A_{d, k_{u}} |S_{k})

and

π_{θ_{u}} (A_{c, k_{u}} |S_{k})

for the corresponding agents, where subscripts d and c represent the discrete and continuous parts of the actions, respectively.

θ_{u}

is the parameter of the actor network for agent u. For the convenience of expression, let

π_{θ_{u}} (A_{k_{u}} | S_{k})

denote the discrete or continuous strategies. In an actor network, the output branches of

π_{θ_{u}} (A_{k_{u}} | S_{k})

share multiple neural network layers. Since the MAHPPO algorithm adopts a random strategy, the discrete part of the actor network outputs the probability value of each discrete action. Moreover, the discrete action is sampled randomly based on the Softmax distribution. The continuous part of the actor network follows the parameterized Gaussian distribution and outputs the mean and variance of the continuous actions. The continuous part adopts the rule function in the last layer of the neural network.

RL is a process in which an agent constantly interacts with the environment to find the optimal policy to maximize the cumulative discount reward

R_{k} = \sum_{τ = 0}^{\infty} {γ^{τ} r}_{k + 1 + τ}

, where

γ

is a discount factor. The critic network outputs the state value function

V_{ϕ} (S_{t}) = E_{π_{θ}} [R_{k} |S = S_{k}]

, where the expectation of

R_{k}

is obtained with the policy

π_{θ_{u}} (A_{k_{u}} | S_{k})

. The critic network is updated by minimizing the loss function, shown as

L (ϕ) = E [{(R_{k} - V_{ϕ} (S_{k}))}^{2}],

(21)

where

ϕ

is the critic network parameter.

In order to prevent excessive differences in the policy updates of the actor networks, a clipped surrogate loss function was adopted [30], which is expressed as

L^{CLIP} (θ_{u}) = E [min (r_{θ_{u}} {\hat{A}}_{k}), c l i p (r_{θ_{u}}, 1 - Δ, 1 + Δ) {\hat{A}}_{k}],

(22)

where

r_{θ_{u}} = \frac{π_{θ_{u}} (A_{k_{u}} | S_{k})}{π_{θ_{u}^{old}} (A_{k_{u}} | S_{k})}

is the ratio of the current policy to the old policy and

π_{θ_{o l d}} (A_{k} | S_{k})

represents the old policy.

Δ

is a hyperparameter.

{\hat{A}}_{k}

is the advantage function estimated by the critic network, which is used to measure the advantage of action

A_{k}

in state

S_{k}

compared with the average action. The detailed definition of

{\hat{A}}_{k}

is given by

{\hat{A}}_{k} = R_{k} - V_{ϕ} (S_{k}) .

(23)

In order to avoid falling into a suboptimal policy and encourage agents to explore, an entropy coefficient

H (π_{θ_{u}} (A_{k_{u}} | S_{k}))

is introduced into the loss function of the actor networks. Then, the loss function of the actor network is expressed as

L (θ_{u}) = \sum_{u = 1}^{U} [L^{CLIP} (θ_{u}) + ι E [H (π_{θ_{u}} (A_{k_{u}} | S_{k}))]],

(24)

where

ι

is a hyperparameter.

The detailed flow of the MAHPPO algorithm is shown in Algorithm 1. In order to enhance the robustness of the algorithm, we normalized

S

and

A

to the range

[- 1, 1]

. Moreover, the reward scaling technique was used for improving algorithm performance [31].

3.2. SMP-Based Buoy–UAV Association Algorithm

Although the designed MAHPPO algorithm can determine the UAV mission modes, the association relationship between the buoys and UAVs is still unknown. In this subsection, we focus on addressing the association relationship problem. Given the transmit power of the buoys and the flight velocity and flight angle of the UAVs, P1 can be further written as

\begin{matrix} (P 3) : max_{A} & \sum_{k = 1}^{K} \sum_{i = 0}^{M} R_{u, i, k} \\ s . t . C 3, C 4, C 6, C 7, C 11 - C 16 . \end{matrix}

(25)

Since P3 is a mixed-integer non-convex problem, the SMP method is used to obtain the optimal solution in an iterative manner.

The SMP was first proposed in [32], which contains two groups of elements, called the male set and female set. Everyone in the set has his/her own preference list for the opposite sex. According to the preference list, the current match was considered stable if there was not one male or female who rated the female or male in the other couples as better than the current partner in any two couples. The Gale–Shapley (GS) algorithm effectively solves the above problems. The main idea of the GS algorithm is to take the male’s point of view (either male or female) and let a male select a female according to his/her preference list. The female temporarily accepts the choice until a better man chooses her and then updates her choice. According to the above process, the two sets continuously cycle and update the selection until the match is stable.

Algorithm 1 Algorithm.

1:: Initialize the actor network parameters $θ_{u}, \forall u \in U$ and the critic network parameter $ϕ$ .
2:: Initialize the experience replay buffer $D$ with size D, mini-batch size $D^{'}$ , sample reuse time N, and state $S_{0}$ .
3:: for episode = 1 to $E p_{\max}$ do
4:: Initialize the number of total mission completion time slots $k = 1$ , the energy penalty $E_{pe}$ , and the position penalty $ρ_{k}$ .
5:: while done is not True do
6:: Every agent sample action $A_{u, k}$ with $π_{θ_{u}} (A_{k_{u}} | S_{k})$ .
7:: if the UAV u flies beyond the target area then
8:: $ρ_{k} = 1 / U$ , cancel the actions of $V_{u, k}$ and $ψ_{u, k}$ , and update $r_{k}$ , $S_{k + 1}$ based on the current state.
9:: end if
10:: if $E_{u}^{total} > {\bar{E}}_{u}, \forall u \in U$ then
11:: Let done be True, and $E_{pe} = - 50$
12:: end if
13:: Get the reward $r_{k}$ and next state $S_{k + 1}$ .
14:: Store experience data $(S_{k}, A_{k}, r_{k}, S_{k + 1})$ in $D$ .
15:: $k = k + 1$ .
16:: if UAVs complete the mission or $k = K^{\max}$ then
17:: Let done be True.
18:: end if
19:: if $D$ is full then
20:: Compute advantages ${\hat{A}}_{k}$ and the state value $V_{ϕ} (S_{t})$ .
21:: for epoch = 1 to $N \times (D / D^{'})$ do
22:: Sample $D^{'}$ from $D$ .
23:: Compute $L (θ_{u})$ of each actor network by (21).
24:: Update each actor network by a gradient method $θ_{u} \leftarrow θ_{u} - β_{a} \nabla_{θ_{u}} L (θ_{u})$ .
25:: Compute $L (ϕ)$ by (18).
26:: Update critic network by $ϕ \leftarrow ϕ - β_{a} \nabla_{ϕ} L (ϕ)$ .
27:: end for
28:: Clear the experience replay buffer $D$ .
29:: end if
30:: end while
31:: end for

In this paper, the buoy–UAV association relationship problem can be regarded as a preference ordering problem, which is a standard SMP. The UAVs and buoys can be regarded as males and females, respectively. In contrast to the traditional SMP problem, U and M may not be equal. At time slot k, the goal of minimizing the males’ cost can be regarded as maximizing the total amount of data collected by the UAVs. The amount of data collected is positively correlated with the SNR, so the preference value could be denoted by the SNR between the UAVs and buoys. According to the

P_{u, k}

obtained by the MAHPPO algorithm, the transmit power matrix of the buoys at time slot k can be expressed as

P = {[\begin{matrix} P_{1, k} & \dots & P_{1, k} & \dots & P_{1, k} \\ ⋮ & ⋱ & ⋰ & ⋮ \\ P_{u, k} & P_{u, k} & P_{u, k} \\ ⋮ & ⋰ & ⋱ & ⋮ \\ P_{U, k} & \dots & P_{U, k} & \dots & P_{U, k} \end{matrix}]}_{U \times M}

(26)

Then, the preference matrix can be expressed as

G = {[\begin{matrix} g_{1, 1, k} & \dots & g_{1, m, k} & \dots & g_{1, M, k} \\ ⋮ & ⋱ & ⋰ & ⋮ \\ g_{u, 1, k} & g_{u, m, k} & g_{u, M, k} \\ ⋮ & ⋰ & ⋱ & ⋮ \\ g_{U, 1, k} & \dots & g_{U, m, k} & \dots & g_{U, M, k} \end{matrix}]}_{U \times M}

(27)

It is unfair to match the association relationship directly according to the above preference matrix. For example, the distance between UAV u and buoy m is further than that between UAV

u^{'}

and buoy m, but the agent u chooses higher transmit power to make

g_{u, m, k} > g_{u^{'}, m, k}

, which causes buoy m to tend to be associated with UAV u and consume more energy of buoy m. Hence, the transmit power of all buoys was set to be the same.

Since the distances between the buoys and UAV u are different, there is no element value the same in any row. For buoy m, the distances between the UAVs and buoy m are also different due to the limit of

d^{min}

, so there is no element value the same in any column, except in the case of

P_{u, k}^{'} = 0

. Hence, there must be

u \neq u *

and

m \neq m *

if there is

g_{u, m, k} = g_{u *, m *, k}

in

G

. The maximum SNR is defined as

g^{max}

. Combined with the idea of the GS algorithm, we introduce Lemma 1.

Lemma 1.

The buoy–UAV with

g^{max}

in

G_{U, M, k}

is the best choice for each.

Proof of Lemma 1.

Suppose

U = 2

and

M = 3

and the preference of the first and second UAVs is

g_{1, 1, k} > g_{1, 2, k} > g_{1, 3, k}

and

g_{2, 1, k} > g_{2, 3, k} > g_{2, 2, k}

, respectively. The preferences of the three buoys are

g_{2, 1, k} > g_{1, 1, k}

,

g_{1, 2, k} > g_{2, 2, k}

, and

g_{2, 3, k} > g_{1, 3, k}

. Hence, we have

g^{max} = g_{2, 1, k}

. First, UAV 1 and UAV 2 both choose Buoy 1 according to their own preferences. However, Buoy 1 is selected to be associated with UAV 2 because of

g_{2, 1, k} > g_{1, 1, k}

. The second row and first column of

G_{U, M, k}

are deleted to obtain

G^{'}

. UAV 1 chooses the maximum value in

G^{'}

, which means it chooses to be associated with Buoy 2, and so on. □

The above result could be also obtained by changing the goal to the cost of the buoys. Hence, the SAA algorithm achieves stable matching while obtaining the global optimal solution and is shown in Algorithm 2.

3.3. Joint Optimization Algorithm

After obtaining the association relationship between the buoys and UAVs, we aimed to determine the UAV–OBS association relationship. The considered association relationship subproblem of the UAV–OBS with the given

V

,

ψ

and

P

is formulated as

\begin{matrix} (P 4) : max_{A} & \sum_{k = 1}^{K} \sum_{u = 1}^{U} R_{u, 0, k} \\ s . t . C 4, C 6, C 11, C 12, C 15, C 16 . \end{matrix}

(28)

Although agents choose to offload the data, they cannot determine whether the actions satisfy the relevant constraints of subproblem P4. Hence, a heuristic algorithm of the UAV–OBS association is proposed to guarantee that the UAV’s actions satisfy these constraints. The detailed procedure is shown in Algorithm 3. In particular, lines 3–10 determine whether the current action selections

χ_{k}

of the UAVs satisfy the relevant constraints and update the action selections

χ_{k}^{new}

. In line 5, the bandwidth occupied by each UAV is not available due to the uncertainty of the UAV–OBS association. Hence, we take the maximum value B satisfying constraints C15 and C16 when solving the problem P4.

Algorithm 2 SMP-based buoy-UAV association algorithm (SAA).

1:: $U^{co}$ , $χ_{k}^{new}$ , $V_{k}$ , $c_{k}^{buoy}$ , $κ_{k}$ , $P = \{P_{u, k}, \forall u \in U^{co}\}$ , and $h_{u, m, k}, \forall u \in U, m \in M$ .
2:: Set $i = 0$ .
3:: for $u = 1$ to $U^{co}$ do
4:: $P^{\max} = max (P)$ .
5:: Let the transmit power of all buoys be $P^{\max}$ , and delete $P^{\max}$ from $P$ .
6:: Obtain the preference matrix $G$ according to $h_{u, m, k}$ and $P^{\max}$ .
7:: Obtain the energy consumption $E_{u}^{pr}, \forall u \in U^{co}$ according to $V_{k}$ .
8:: Obtain the energy consumption of all buoys $E_{m}, \forall m \in M$ according to $P^{\max}$ .
9:: for $m = 1$ to M do
10:: if $c_{m, k}^{buoy} = 0$ , or $g_{u, m, k}$ , $E_{u}^{pr}$ , or $E_{m}$ does not satisfy C3 and C6 then
11:: Set $g_{u, m, k} = 0$ in $G$ , so as to obtain the new preference matrix $G^{'}$ .
12:: end if
13:: end for
14:: for $n = 1$ to $U^{co}$ do
15:: Obtain the maximum SNR $g^{max}$ in $G^{'}$ .
16:: if $g^{max} \neq 0$ then
17:: Obtain the UAV number u and buoy number m corresponding to $g^{max}$ , and $κ_{u, k} = m$ .
18:: Delete the UAV number u from $U^{co}$ , and set $χ_{u, k}^{new} = 0$ .
19:: Set the row u and column m of $G^{'}$ equal to 0, and then, update $G^{'}$ .
20:: else
21:: break.
22:: end if
23:: end for
24:: if $U^{co} = \emptyset$ then
25:: break.
26:: end if
27:: end for
28:: Let $κ_{k}^{'}$ indicate the set of UAVs that are first transformed.
29:: if UAV u is not associated with any buoy then
30:: Set $χ_{u, k}^{new} = 1$ , and add the number of UAVs u to $κ_{k}^{'}$ .
31:: end if
32:: if $χ_{u, k} = 1$ and $χ_{u, k}^{new} = 1$ then
33:: Set $χ_{u, k}^{new} = 2$ , and $κ_{k} = - 1$ . Then, add the number of UAVs u to $κ_{k}^{'}$ .
34:: end if
35:: return $χ_{u, k}^{new}$ , $κ_{k}$ , $κ_{k}^{'}$ .

In summary, the problem P1 could be solved effectively and efficiently with the proposed SAA, UAA, and MAHPPO algorithms. The detailed process is shown in Algorithm 4. Specifically, we first obtain

χ_{k}

according to the actions of the agents at time slot k in Algorithm 1. According to

χ_{k}

, we divide the number of UAVs into two initial groups: collection

U^{co}

and offloading

U^{of}

. In line 8, we perform Algorithm 3 to judge

U^{of}

. In lines 11–24 of Algorithm 3, the UAV whose action does not satisfy the constraints is transformed to

χ_{u, k}^{new} = 0

, and its number is formed into a new set

κ_{k}^{'}

. In lines 9–10, we update

U^{co}

by

U^{co} \cap κ_{k}^{'}

. Then, we perform Algorithm 2 to judge

U^{co}

. Some UAVs in

κ_{k}^{'}

may not satisfy the constraints of both data collection and data offloading after two judgments. Hence, in lines 32–34 of Algorithm 2, we set

{χ^{*}}_{u, k} = 2

if

χ_{u, k} = 0

and

χ_{k}^{new} = 0

. In lines 12–13, for UAVs that have only experienced the judgment of Algorithm 2 once, but did not satisfy the constraints, we perform Algorithm 3 to make the second judgment. In lines 22–24 of Algorithm 3, we set

{χ^{*}}_{u, k} = 2

if

χ_{u, k} = 1

and

χ_{k}^{new} = 1

. Hence, there may be some UAVs that neither collect, nor offload data in a certain time slot. After the above three judgments, the association relationship of all UAVs can be determined. The above procedure is looped until done is True.

Algorithm 3 UAV–OBS association algorithm (UAA).

1:: Input ${\bar{g}}_{of}$ , $V_{u, k}, \forall u \in U$ , $U^{of}$ , $χ_{k}^{new}$ , $κ_{k}$ , $P_{u, 0, k}, \forall u \in U$ , and current channel gain $h_{u, 0, k}, \forall u \in U$ .
2:: repeat
3:: Obtain the SNR of UAV–OBS $g_{u, 0, k}$ according to $P_{u, 0, k}$ .
4:: Obtain the maximum transmission rate of UAV ${R_{max}}_{u, 0, k}$ according to B.
5:: Obtain the energy consumption of UAVs $E_{u}^{total}$ according to $V_{u}, k$ ,.
6:: if $g_{u, 0, k}$ , $E_{u}^{total}$ , and ${R_{max}}_{u, 0, k}$ satisfy C4, C6, and C15, respectively then
7:: Set $χ_{k}^{new} = 1$ , and $κ_{u, k} = 0$ .
8:: else
9:: Set $χ_{k}^{new} = 0$
10:: end if
11:: if ${R_{max}}_{u, 0, k}$ does not satisfy C15, and the data of all buoys are fully collected then
12:: if $g_{u, 0, k}$ and $E_{u}^{total}$ satisfy C4 and C6 then
13:: Set $χ_{k}^{new} = 1$ and $κ_{u, k} = 0$ .
14:: else
15:: Set $χ_{k}^{new} = 0$ .
16:: end if
17:: end if
18:: $u = u + 1$
19:: until $u = U^{of}$
20:: Let $κ_{k}^{'}$ indicate the set of UAVs that are first transformed.
21:: if $χ_{u, k} = 0$ and $χ_{k}^{new} = 0$ then
22:: Set ${χ^{*}}_{u, k} = 2$ and $κ_{k} = - 1$ .
23:: Add the number of UAVs u to $κ_{k}^{'}$ .
24:: end if
25:: return $χ_{u, k}^{new}$ , $κ_{k}$ , $κ_{k}^{'}$ .

Algorithm 4 Joint optimization algorithm based on SAA, UAA, and MAHPPO (SU-MAHPPO).

Input:: The initial positions of UAVs $q_{u, 1}, \forall u \in U$ , the positions of buoys $x_{m}, y_{m}, \forall m \in M$ , and the position of OSB $x_{0}, y_{0}$ .
Output:: $A, P, V, ψ$ .
1:: /* In Algorithm 1 */
2:: for episode = 1 to $E p_{\max}$ do
3:: while done is not True do
4:: Perform $L i n e s$ 6–9 of Algorithm 1.
5:: Obtain $χ_{k}$ according to $A_{u, k}$ .
6:: Obtain initial $U^{co}$ and $U^{of}$ according to $χ_{k}$ .
7:: Let $χ_{k}^{new} = χ_{k}$ and $κ_{u, k} = - 1, \forall u \in U$ .
8:: Update $χ_{k}^{new}$ , $κ_{k}$ , and $κ_{k}^{'}$ by performing Algorithm 3.
9:: if $κ_{k}^{'}$ is not null or $χ_{u, k}^{new} = 0$ exists in $χ_{k}^{new}$ then
10:: $U^{co} \leftarrow U^{co} \cap κ_{k}^{'}$ .
11:: Update $χ_{u, k}^{new}$ , $κ_{k}$ , and $κ_{k}^{'}$ by performing Algorithm 2.
12:: if $κ_{k}^{'}$ is not null then
13:: $U^{of} \leftarrow κ_{k}^{'}$ .
14:: Update $χ_{u, k}^{new}$ , $κ_{k}$ , and $κ_{k}^{'}$ by performing Algorithm 3.
15:: end if
16:: end if
17:: Perform $L i n e s$ 10–26 of Algorithm 1.
18:: end while
19:: end for

3.4. Complexity Analysis

The complexity of a neural network usually depends on the state dimension of the input layer, the number of neural network layers, and the number of neurons in the hidden layers and output layer. Since the MAHPPO algorithm contains U actor networks and a critic network, the complexity of the MAHPPO algorithm can be expressed as

O (E_{p_{max}} K^{max} (U \sum_{l = 1}^{L^{a} - 1} e_{l}^{a} e_{l + 1}^{a} + \sum_{l = 1}^{L^{c} - 1} e_{l}^{c} e_{l + 1}^{c}))

, where

L^{a}

and

L^{c}

represent the number of neural network layers of the actor and critic networks, respectively, and

e_{l}^{c}

and

e_{l}^{a}

represent the number of neurons in the l-th layer of the actor and critic networks, respectively. Since the maximum value of

U_{co}

and

U_{of}

is U, the highest algorithm complexity of Algorithm 2 and Algorithm 3 is

O (U (U + M))

and

O (U)

. Obviously, the complexity of Algorithm 2 is higher than that of Algorithm 3. In order to reduce the complexity of Algorithm 4, Algorithm 3 is executed with priority so that Algorithm 2 only needs to be executed once. In summary, the computational complexity of Algorithm 4 is

O (E_{p_{max}} K^{max} (U \sum_{l = 1}^{L^{a} - 1} e_{l}^{a} e_{l + 1}^{a} + \sum_{l = 1}^{L^{c} - 1} e_{l}^{c} e_{l + 1}^{c} + U (U + M) + U))

.

4. Simulation Results and Discussion

In the simulations, we set a 5000 m × 5000 m target area and a 1500 m × 1500 m no-fly zone.

M = 10

buoys with a data volume of 10 Mbits and a maximum transmit power

P_{\max} = 24

dBm were randomly distributed in the target area. The horizontal coordinates of the OBS were

(0, 0)

m. The initial horizontal coordinates of the UAVs with the flight height

H = 100

m were

(0, 0)

m,

(0, 2500)

m, and

(2500, 0)

m, respectively. The maximum flight velocity and angle of the UAVs were

V_{\max} = 50

m / s

and

ψ_{u, k} \in [0, 2 π]

, respectively. The transmit power of each UAV was

P_{u, 0, k} = 0.1

W. The minimum safety distance between the UAVs was

d_{\min} = 50

m. The time slot duration was

δ = 1

s. For the MAHPPO algorithm, we set the length of the episode

E p_{\max} = 18, 000

,

K^{\max} = 250

. The actor network and the critic network both had three hidden layers with

(256, 128, 64)

neurons. The learning rates of the actor network and the critic network were

b e t a_{a} = 0.0001

and

b e t a_{b} = 0.0003

, respectively. The discount factor was

γ = 0.99

. The experience replay buffer size D, mini-batch size

D^{'}

, and sample reuse time N were set to 1024, 256, and 8, respectively. The hyperparameters were

Δ = 0.2

and

ι = 0.01

. Moreover, our simulation was based on Python 3.8 with the Pytorch package. Other simulation parameters are shown in Table 2.

Furthermore, we compared the proposed SU-MAHPPO algorithm with three baseline algorithms, which are listed as follows:

SU-MAPDQN algorithm: This algorithm refers to using the multi-agent PDQN (MAPDQN) algorithm to replace the MAHPPO algorithm in the algorithm proposed in our paper.
SU-MAPPO algorithm: This algorithm refers to using the multi-agent PPO (MAPPO) algorithm to replace the MAHPPO algorithm in the algorithm proposed in this paper. Furthermore, the continuous action space of the UAVs is discretized in this algorithm. The flight angle of UAV u at time slot k is expressed in four directions: up, down, left, and right, i.e., $ψ_{u, k} \in \{0, π / 2, π, 3 π / 2\}$ . The flight velocity of the UAV at time slot k is expressed as $V_{u, k} \in \{0, 10, 20, 30, 40, 50\}$ m/s.
MAHPPO algorithm: In this algorithm, the association relationships of the buoy–UAV and UAV–OBS are optimized. Hence, the action space of each agent of this algorithm is denoted as $A_{u, k} = \{I, ψ_{u, k}, V_{u, k}, P_{u, k}\}$ . The state space is denoted as $S_{k} = \{κ_{k}, κ_{k}^{a}, c_{k}^{buoy}, E_{k}^{buoy}, x_{k}, y_{k}, c_{k}^{uav}, E_{k}^{uav}, B_{k}\}$ , where $κ_{k}^{a}$ represents the discrete actions of all agents. The MAHPPO algorithm is shown in Algorithm 5.

Algorithm 5 MAHPPPO algorithm.

1:: /* In Algorithm 1 */
2:: Perform $L i n e s$ 1–6 of Algorithm 1.
3:: Obtain $κ_{k}^{a}$ according to the discrete actions of $A_{k}$ .
4:: Let $κ_{k} = κ_{k}^{a}$
5:: for $u = 1$ toUdo
6:: for $u * = 1$ to U do
7:: if $κ_{u, k}^{a} = κ_{u *, k}^{a}$ and $κ_{u, k}^{a} \neq 0$ then
8:: if $g_{u, k} > g_{u *, k}$ then
9:: Set $κ_{u, k} = - 1$ .
10:: end if
11:: end if
12:: if does not satisfy C3, C4, C6, C7, and C15 then
13:: Set $κ_{u, k} = - 1$ .
14:: end if
15:: end for
16:: end for
17:: Perform $L i n e s$ 7–31 of Algorithm 1.

As shown in Figure 3, we first compared the accumulative reward of SU-MAHPPO with the other three algorithms. The curves were smoothed for the convenience of observation. We can see that the proposed SU-MAHPPO algorithm was significantly better than the other algorithms. The SU-MAHPPO scheme could be stable and convergent after 10,000 episodes. Although the SU-MAPPO algorithm can quickly converge, the accumulated reward was lower than the SU-MAHPPO algorithm, and the curve suddenly dropped after 15,000 episodes of training. Moreover, the SU-MAPDQN algorithm cannot be convergent after 18,000 episodes. This is because the continuous action values corresponding to each value of discrete actions are output in MAPDQN, which obviously increases the computational complexity. Furthermore, the performance of the MAHPPO algorithm was far lower than that of our proposed algorithm within 18,000 episodes. This is because the complex action space and state space make it more difficult for the agents to explore. This confirms the advantages of the combination of the SAA algorithm and UAA algorithm to narrow the exploration domain of the agents.

Figure 4 shows the effectiveness of the technologies mentioned in Section 3.1. It can be seen that the proposed SU-MAHPPO algorithm without reward scaling had difficulty converging and was more unstable within 18,000 episodes. The reason is that the reward scaling is beneficial to the fitting of the value function. The curve of the SU-MAHPPO algorithm without state and action normalization shows an upward trend, but it also cannot converge within 18,000 episodes. This is because the order of magnitude difference between the parameters in the state

S_{k}

and action space

A_{k}

is large, which is not conducive to the training of neural networks. Reward scaling is beneficial to the fitting of the value function. Hence, the above two techniques can effectively enhance the convergence and stability of the proposed algorithm.

Figure 5 shows the UAV trajectories’ comparison between the SU-MAHPPO algorithm and SU-MAPPO algorithm. The mission completion time of SU-MAHPPO and SU-MAPPO was 53 s and 67 s, respectively. SU-MAHPPO is better, since the designed reward function is positively related to the data transmission rate in each time slot. Moreover, the continuous action space of the SU-MAHPPO algorithm allows the agents to more accurately control the positions of the UAVs in each time slot. Besides, the continuous action space makes it more accurately adjust the transmit power of the buoys to maximize the accumulative reward. The discretized action space makes it difficult for the UAVs to reach the optimal trajectory for collecting and offloading data. Furthermore, it can be seen that the UAV trajectories of the proposed algorithm were not close to the buoy or OBS. This is because the UAVs can change the mission mode in different time slots. Take the trajectory of the UAV close to the buoy for example. In this case, the time of data collection will be reduced, but the UAV will be far away from the OBS, which makes it difficult to meet the SNR threshold of data offloading. Hence, the UAV needs more time to offload the data, so that the total mission completion time is longer.

Figure 6 shows the trajectories of the three groups of UAVs taking off from different initial positions. The initial positions of the first group of UAVs are the same as the default positions, namely Position 1. The initial positions of the second group of UAVs are (1000, 1000), (1500, 2000), and (3000, 2000), namely Position 2. The initial positions of the third group of UAVs are (1500, 1500), (1500, 3000), and (3000, 1500), namely Position 3. It can be seen that the UAVs can avoid the no-fly zone even if they take off at the edge of the no-fly zone. The performance advantage benefits what we set as the position penalty in the reward function, which can guide the agent to control the UAVs flying in the target area.

Figure 7 shows the comparison of the multi-UAV trajectories based on the SU-MAPPO algorithm with different SNR thresholds of data collection. It can be seen that the UAV trajectories were closer to the buoys with the growth of the SNR threshold. This is because the designed reward function is mainly related to the transmission rate and the time reward

K^{*}

. With the SNR thresholds increasing, the UAV needs to be close enough to the buoy to achieve the minimum transmission rate requirement. The agent chooses to let the UAV be closer to the buoy to enhance the data collection transmission rate, which contributes to shortening the data collection time and obtaining a larger value of

K^{*}

.

Figure 8 shows the total mission completion time of the SU-MAHPPO algorithm and the SU-MAPPO algorithm with different SNR thresholds of data collection. Figure 9 and Figure 10 show the total mission completion time of the SU-MAHPPO algorithm and the SU-MAPPO algorithm with different channel conditions. It can be seen that the performance of our proposed algorithm is significantly better than the SU-MAPPO algorithm. The reason is that the discretized action space makes the SU-MAPPO algorithm unable to explore and select the optimal action in the considered environment, which further verifies the advantage of the proposed algorithm in the continuous action space.

5. Conclusions

This paper studied the multi-UAV-assisted data collection and offloading problem in the maritime IoT. The proposed MAHPPO-based joint optimization algorithm could significantly reduce the total mission completion time. First, a MAHPPO-based algorithm was proposed to obtain the UAV trajectories, transmit power of the buoy, and mission mode of the UAVs. Second, an SMP-based algorithm was designed to solve the association relationship subproblem. Finally, a joint optimization algorithm was proposed in which the SAA algorithm and the UAA algorithm were executed alternately to finally determine the association relationship of the buoy–UAV and UAV–OBS. The simulation results showed that the proposed algorithm can effectively reduce the total mission completion time of the UAVs, and its performance was significantly better than the comparison schemes.

In the future work, we will investigate cooperative multi-UAV communication to further improve the data collection and offloading performance. Moreover, the location uncertainties of the buoys will be considered to enhance the channel estimation accuracy to improve the buoy–UAV association efficiency.

Author Contributions

Conceptualization, Z.L., Y.D. and L.L.; methodology, Z.L., Y.D. and L.L.; software, Z.L. and Y.D.; validation, Z.L.; formal analysis, Z.L. and Y.D.; investigation, Z.L., L.L. and Y.D.; data curation, Z.L. and Y.D.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., Y.D., L.L. and B.L.; supervision, Y.D. and L.L.; project administration, Y.D. and L.L.; funding acquisition, Y.D., L.L. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Key Research and Development Program of China under Grant 2019YFE0111600, in part by the National Natural Science Foundation of China under Grant 62101089, Grant 62002042, Grant 51939001, and Grant 61971083, in part by the China Postdoctoral Science Foundation under Grant 2021M700655 and Grant 2021M690022, in part by the Liaoning Revitalization Talents Program under Grant XLYC2002078, and in part by the Major Key Project of PCL under Grant PCL2021A03-1.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to the National Key Research and Development Program of China, the National Natural Science Foundation of China, the Liaoning Revitalization Talents Program, and the Major Key Project of PCL for the financial support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, Y.; Yan, J.; Zhao, X. Deep Reinforcement Learning Based Latency Minimization for Mobile Edge Computing With Virtualization in Maritime UAV Communication Network. IEEE Trans. Veh. Technol. 2022, 71, 4225–4236. [Google Scholar] [CrossRef]
Fang, X.; Feng, W.; Wang, Y.; Chen, Y.; Ge, N.; Ding, Z.; Zhu, H. NOMA-Based Hybrid Satellite-UAV-Terrestrial Networks for 6G Maritime Coverage. IEEE Trans. Wirel. Commun. 2022, accepted. [Google Scholar] [CrossRef]
Xia, T.; Wang, M.M.; Zhang, J.; Wang, L. Maritime Internet of Things: Challenges and Solutions. IEEE Wirel. Commun. 2020, 27, 188–196. [Google Scholar] [CrossRef]
Xia, T.; Wang, M.M.; You, X. Satellite Machine-Type Communication for Maritime Internet of Things: An Interference Perspective. IEEE Access 2019, 7, 76404–76415. [Google Scholar] [CrossRef]
Huo, Y.; Dong, X.; Beatty, S. Cellular Communications in Ocean Waves for Maritime Internet of Things. IEEE Internet Things J. 2020, 7, 9965–9979. [Google Scholar] [CrossRef] [Green Version]
Gang, S.; Jun, W.; Junwei, P. Research on 5G Wireless Signal Propagation and Coverage Capability. In Proceedings of the 2021 IEEE 4th International Conference on Electronic Information and Communication Technology (ICEICT), Xi’an, China, 18–20 August 2021; pp. 529–533. [Google Scholar]
Jiang, S. A possible development of marine Internet: A large scale cooperative heterogeneous wireless network. In Internet of Things, Smart Spaces, and Next Generation Networks and Systems; Balandin, S., Andreev, S., Koucheryavy, Y., Eds.; Springer: Cham, Switzerland, 2015; pp. 481–495. [Google Scholar]
Liu, Z.; Cao, Y.; Gao, P.; Hua, X.; Zhang, D.; Jiang, T. Multi-UAV network assisted intelligent edge computing: Challenges and opportunities. China Commun. 2022, 19, 258–278. [Google Scholar] [CrossRef]
Wang, J.; Na, Z.; Liu, X. Collaborative Design of Multi-UAV Trajectory and Resource Scheduling for 6G-Enabled Internet of Things. IEEE Internet Things J. 2021, 8, 15096–15106. [Google Scholar] [CrossRef]
Shen, C.; Chang, T.H.; Gong, J.; Zeng, Y.; Zhang, R. Multi-UAV Interference Coordination via Joint Trajectory and Power Control. IEEE Trans. Signal Process. 2020, 68, 843–858. [Google Scholar] [CrossRef] [Green Version]
Bejaoui, A.; Park, K.H.; Alouini, M.S. A QoS-Oriented Trajectory Optimization in Swarming Unmanned-Aerial-Vehicles Communications. IEEE Wirel. Commun. Lett. 2020, 9, 791–794. [Google Scholar] [CrossRef] [Green Version]
Diao, X.; Yang, W.; Yang, L.; Cai, Y. UAV-Relaying-Assisted Multi-Access Edge Computing With Multi-Antenna Base Station: Offloading and Scheduling Optimization. IEEE Trans. Veh. Technol. 2021, 70, 9495–9509. [Google Scholar] [CrossRef]
Kuo, Y.C.; Chiu, J.H.; Sheu, J.P.; Hong, Y.W.P. UAV Deployment and IoT Device Association for Energy-Efficient Data-Gathering in Fixed-Wing Multi-UAV Networks. IEEE Trans. Green Commun. Netw. 2021, 5, 1934–1946. [Google Scholar] [CrossRef]
Gao, Y.; Wu, Y.; Cui, Z.; Yang, W.; Hu, G.; Xu, S. Robust trajectory and communication design for angle-constrained multi-UAV communications in the presence of jammers. China Commun. 2022, 19, 131–147. [Google Scholar] [CrossRef]
Hua, M.; Yang, L.; Wu, Q.; Swindlehurst, A.L. 3D UAV Trajectory and Communication Design for Simultaneous Uplink and Downlink Transmission. IEEE Trans. Commun. 2020, 68, 5908–5923. [Google Scholar] [CrossRef]
Liu, D.; Xu, Y.; Wang, J.; Chen, J.; Wu, Q.; Anpalagan, A.; Xu, K.; Zhang, Y. Opportunistic Utilization of Dynamic Multi-UAV in Device-to-Device Communication Networks. IEEE Trans. Cogn. Commun. Netw. 2020, 6, 1069–1083. [Google Scholar] [CrossRef]
Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.C.; Kim, D.I. Applications of Deep Reinforcement Learning in Communications and Networking: A Survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar]
Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Li, M.; Chen, C.; Wu, H.; Guan, X.; Shen, X. Edge-Assisted Spectrum Sharing for Freshness-Aware Industrial Wireless Networks: A Learning-Based Approach. IEEE Trans. Wirel. Commun. 2022, 21, 7737–7752. [Google Scholar] [CrossRef]
Wu, F.; Zhang, H.; Wu, J.; Han, Z.; Poor, H.V.; Song, L. UAV-to-Device Underlay Communications: Age of Information Minimization by Multi-Agent Deep Reinforcement Learning. IEEE Trans. Commun. 2021, 69, 4461–4475. [Google Scholar] [CrossRef]
Wang, L.; Wang, K.; Pan, C.; Xu, W.; Aslam, N.; Hanzo, L. Multi-Agent Deep Reinforcement Learning-Based Trajectory Planning for Multi-UAV Assisted Mobile Edge Computing. IEEE Trans. Cogn. Commun. Netw. 2021, 7, 73–84. [Google Scholar] [CrossRef]
Gao, A.; Wang, Q.; Liang, W.; Ding, Z. Game Combined Multi-Agent Reinforcement Learning Approach for UAV Assisted Offloading. IEEE Trans. Veh. Technol. 2021, 70, 12888–12901. [Google Scholar] [CrossRef]
Hausknecht, M.; Stone, P. Deep reinforcement learning in parameterized action space. arXiv 2015, arXiv:1511.04143. [Google Scholar]
Xiong, J.; Wang, Q.; Yang, Z.; Sun, P.; Han, L.; Zheng, Y.; Fu, H.; Zhang, T.; Liu, J.; Liu, H. Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space. arXiv 2018, arXiv:1810.06394. [Google Scholar]
Yin, S.; Yu, F. Resource Allocation and Trajectory Design in UAV-Aided Cellular Networks Based on Multiagent Reinforcement Learning. IEEE Internet Things J. 2022, 9, 2933–2943. [Google Scholar] [CrossRef]
Fan, Z.; Su, R.; Zhang, W.; Yu, Y. Hybrid actor-critic reinforcement learning in parameterized action space. arXiv 2019, arXiv:1903.01344. [Google Scholar]
Zhang, J.; Liang, F.; Li, B.; Yang, Z.; Wu, Y.; Zhu, H. Placement optimization of caching UAV-assisted mobile relay maritime communication. China Commun. 2020, 17, 209–219. [Google Scholar] [CrossRef]
Zeng, Y.; Xu, J.; Zhang, R. Energy Minimization for Wireless Communication With Rotary-Wing UAV. IEEE Trans. Wirel. Commun. 2019, 18, 2329–2345. [Google Scholar] [CrossRef] [Green Version]
Hao, Z.; Xu, G.; Luo, Y.; Hu, H.; An, J.; Mao, S. Multi-Agent Collaborative Inference via DNN Decoupling: Intermediate Feature Compression and Edge Learning. arXiv 2022, arXiv:2205.11854. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; Madry, A. Implementation matters in deep policy gradients: A case study on PPO and TRPO. arXiv 2020, arXiv:2005.12729. [Google Scholar]
Gale, D.; Shapley, L.S. College admissions and the stability of marriage. Am. Math. Mon. 1962, 69, 9–15. [Google Scholar] [CrossRef]

Figure 1. The UAV-assisted maritime IoT system.

Figure 2. The MAHPPO algorithm structure of agent u.

Figure 3. Convergence with different algorithms.

Figure 4. Convergence without state and action normalization and reward scaling.

Figure 5. UAV trajectories with different schemes.

Figure 6. UAV trajectories with different initial positions of taking off.

Figure 7. UAV trajectories with different SNR thresholds.

Figure 8. Total mission completion time with different SNR thresholds.

Figure 9. Total mission completion time with different spectrums.

Figure 10. Total mission completion time with different LoS link factors.

Table 1. Comparison between our work and the related references.

Reference	UAV Trajectory	UAV Hovering Location	Uplink	Downlink	Types of Action Variables	Deep Reinforcement Learning
[12]		✓	✓	✓
[13]	✓		✓
[14]	✓		✓
[15]	✓		✓	✓
[16]			✓	✓
[20]	✓			✓	continuous	MADDPG
[21]	✓			✓	continuous	MADDPG
[22]	✓		✓	✓	continuous	MADDPG
[25]	✓			✓	discrete and continuous	MAPDQN
our work	✓		✓	✓	discrete and continuous	MAHPPO

Table 2. Simulation parameters.

Parameter	Value
environment parameter, a, b	9.61, 0.16
excessive path loss, $ξ_{LoS}$ , $ξ_{NLoS}$	1, 20
channel parameter, $μ_{u, i, k}$	1
wavelength, $λ$	0.15 m
noise power, $σ^{2}$	−104 dBm
SNR threshold, ${\bar{g}}_{co}$ , ${\bar{g}}_{of}$	8 dB, 2 dB
energy threshold, ${\bar{E}}_{u}$ , ${\bar{E}}_{m}$	150 kJ, 1.25 J
tip speed of the rotor blade, $U_{tip}$	120
fuselage drag ratio, $d_{0}$	0.6
air density, $ρ$	1.225
rotor solidity, l	0.05
blade angular velocity, f	0.503
blade profile power and induced power, $P_{o}$ , $P_{i}$	79.86 W, 0.99 W
LoS and NLoS link factor, $υ^{LoS}$ , $υ^{NLoS}$	2, 2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, Z.; Dai, Y.; Lyu, L.; Lin, B. Adaptive Data Collection and Offloading in Multi-UAV-Assisted Maritime IoT Systems: A Deep Reinforcement Learning Approach. Remote Sens. 2023, 15, 292. https://doi.org/10.3390/rs15020292

AMA Style

Liang Z, Dai Y, Lyu L, Lin B. Adaptive Data Collection and Offloading in Multi-UAV-Assisted Maritime IoT Systems: A Deep Reinforcement Learning Approach. Remote Sensing. 2023; 15(2):292. https://doi.org/10.3390/rs15020292

Chicago/Turabian Style

Liang, Ziyi, Yanpeng Dai, Ling Lyu, and Bin Lin. 2023. "Adaptive Data Collection and Offloading in Multi-UAV-Assisted Maritime IoT Systems: A Deep Reinforcement Learning Approach" Remote Sensing 15, no. 2: 292. https://doi.org/10.3390/rs15020292

APA Style

Liang, Z., Dai, Y., Lyu, L., & Lin, B. (2023). Adaptive Data Collection and Offloading in Multi-UAV-Assisted Maritime IoT Systems: A Deep Reinforcement Learning Approach. Remote Sensing, 15(2), 292. https://doi.org/10.3390/rs15020292

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Data Collection and Offloading in Multi-UAV-Assisted Maritime IoT Systems: A Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. Channel Model

2.2. Transmission Model

2.3. Energy Consumption Model

2.4. Problem Formulation

3. Joint Optimization Algorithm Design

3.1. MAHPPO Algorithm

3.1.1. Action Space

3.1.2. State Space

3.1.3. Reward Function

3.2. SMP-Based Buoy–UAV Association Algorithm

3.3. Joint Optimization Algorithm

3.4. Complexity Analysis

4. Simulation Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI