Resource Allocation in UAV-D2D Networks: A Scalable Heterogeneous Multi-Agent Deep Reinforcement Learning Approach

Wang, Huayuan; Li, Hui; Wang, Xin; Xia, Shilin; Liu, Tao; Wang, Ruonan

doi:10.3390/electronics13224401

Open AccessArticle

Resource Allocation in UAV-D2D Networks: A Scalable Heterogeneous Multi-Agent Deep Reinforcement Learning Approach

by

Huayuan Wang

¹

,

Hui Li

^1,*,

Xin Wang

²,

Shilin Xia

³,

Tao Liu

¹ and

Ruonan Wang

¹

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129, China

²

Xi’an Intellectual Property Protection Center, Xi’an 710007, China

³

School of Information Engineering, Chang’an University, Xi’an 710018, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(22), 4401; https://doi.org/10.3390/electronics13224401

Submission received: 20 September 2024 / Revised: 28 October 2024 / Accepted: 7 November 2024 / Published: 10 November 2024

(This article belongs to the Special Issue Advanced Techniques for Massive MIMO Systems in Next-Generation Wireless Communication and Networks)

Download

Browse Figures

Versions Notes

Abstract

:

In unmanned aerial vehicle (UAV)-assisted device-to-device (D2D) caching networks, the uncertainty from unpredictable content demands and variable user positions poses a significant challenge for traditional optimization methods, often making them impractical. Multi-agent deep reinforcement learning (MADRL) offers significant advantages in optimizing multi-agent system decisions and serves as an effective and practical alternative. However, its application in large-scale dynamic environments is severely limited by the curse of dimensionality and communication overhead. To resolve this problem, we develop a scalable heterogeneous multi-agent mean-field actor-critic (SH-MAMFAC) framework. The framework treats ground users (GUs) and UAVs as distinct agents and designs cooperative rewards to convert the resource allocation problem into a fully cooperative game, enhancing global network performance. We also implement a mixed-action mapping strategy to handle discrete and continuous action spaces. A mean-field MADRL framework is introduced to minimize individual agent training loads while enhancing total cache hit probability (CHP). The simulation results show that our algorithm improves CHP and reduces transmission delay. A comparative analysis with existing mainstream deep reinforcement learning (DRL) algorithms shows that SH-MAMFAC significantly reduces training time and maintains high CHP as GU count grows. Additionally, by comparing with SH-MAMFAC variants that do not include trajectory optimization or power control, the proposed joint design scheme significantly reduces transmission delay.

Keywords:

unmanned aerial vehicle; D2D communications; cache placement; deep reinforcement learning; scalable implement

1. Introduction

Recently, unmanned aerial vehicles (UAVs) have gained popularity across various industries because of their compact size, affordability, and high flexibility [1]. These attributes enable UAVs to effectively address challenges in traditional communications, such as high implementation costs and limited flexibility in specialized scenarios [2]. In high-density hotspot areas, adding fixed ground nodes is not cost-effective. In such cases, UAV-assisted communication demonstrates high flexibility and ease of deployment [3]. Based on simulations and measurements, LTE networks can support UAV communication. Additionally, 5G massive multiple-input multiple-output (MIMO) antennas enable cellular UAVs to coexist with ground users (GUs), offering significant potential [4]. Edge caching is proposed as a key technology to alleviate traffic load in cellular networks [5]. By storing popular content closer to GUs, such as at base stations (BSs) and UAVs, it reduces content transmission delay and alleviates backhaul traffic load. Additionally, leveraging user-side cache capacity can further alleviate backhaul bottlenecks, for instance, through the implementation of device-to-device (D2D) communication [6]. In UAV-assisted D2D caching networks, collaboration between UAVs and GUs optimizes the use of limited cache space and improves system performance [7].

Due to the highly dynamic nature of UAV-assisted communication systems, achieving real-time accuracy in system modeling is challenging. Traditional optimization methods often convert the original non-convex problem into a series of convex subproblems, which are then solved iteratively [8]. While this approach enables tracking of the original problem, it often sacrifices accuracy. With an increase in the number of UAVs and GUs, computational complexity grows exponentially. Additionally, solving these problems optimally demands global information, which is not always possible in dynamic environments [8]. Recently, deep reinforcement learning (DRL) has garnered significant attention as a data-driven method. Due to its model-free nature and lack of need for real-time or complete prior information, DRL is considered an effective approach for managing task scheduling and wireless resource allocation in rapidly shifting UAV-assisted communication networks [9,10]. Compared to single-agent DRL, multi-agent systems offer several advantages: agents can learn strategies in a distributed manner through local environments, share experiences in a customized way, and maintain robustness by allowing the remaining agents to take over tasks if some agents fail [11].

Recent advancements in UAV-assisted communication have spurred numerous studies focusing on optimizing UAV trajectories, user association, and resource allocation to enhance quality-of-service (QoS) for GUs. To facilitate a better understanding of the methodologies and findings in the existing literature, Table 1 presents a comparative analysis of the related works. To fully leverage the high mobility of UAVs, some studies focus on trajectory design to enhance communication performance. For instance, Ref. [12] proposed a DRL solution for path planning in cellular-connected UAVs, leveraging a quantum-inspired experience replay framework to optimize flight direction while minimizing time costs and expected outage durations. Ref. [13] proposed a multi-agent deep reinforcement learning (MADRL) framework for optimizing the 3D trajectory design of cache-enabled UAVs in a dynamic wireless caching network, significantly enhancing network throughput by enabling UAVs to cooperatively learn from their experiences. On the other hand, several studies focused on the joint optimization of trajectory and resource allocation. Ref. [10] formulated a joint optimization problem involving trajectory design, user association, and power allocation, utilizing a strategic resource allocation algorithm that combined reinforcement learning and deep learning to maximize system utility for all served GUs. In [14], they jointly optimized user association, power allocation of non-orthogonal multiple access (NOMA), UAV deployment and caching placement, utilizing branch and bound (BaB) and DRL to minimize transmission delay in UAV-assisted caching networks for mixed augmented reality (AR) and multimedia applications, demonstrating lower long-term transmission delay in dynamic environments. Ref. [15] proposed a joint trajectory planning and cache management framework for UAV-assisted content delivery in intelligent transportation systems, addressing the challenges of dynamic vehicular networks through a proximal policy optimization (PPO) approach. Ref. [16] proposed a multi-UAV cooperative caching network optimization method supporting NOMA, which decoupled the problem into caching decisions and trajectory design, significantly improving cache hit probability (CHP) using the matching-DRL algorithm. However, [10,13,14] categorized continuous values into different levels of discrete values, while [15] only involved a discrete action space, limiting their ability to adapt to mixed action spaces. Furthermore, in these jobs GUs rely solely on UAVs to provide services.

There is another line of works that have explored collaboration between UAVs and ground nodes. Ref. [8] proposed a joint optimization framework for UAV trajectory design and GU access control, utilizing the MADRL algorithm to enhance throughput and fairness in air–ground coordinated communication systems. Ref. [18] employed a multi-agent deep deterministic policy gradient (MADDPG) algorithm for resource management in UAV-assisted vehicular networks. In this framework, the macro base station (MBS) and UAV collaboratively make association decisions and allocate appropriate amounts of resources to vehicles to meet heterogeneous QoS requirements while maximizing task offloading efficiency. In [17], they proposed a framework that treats the MBS and UAV as cooperative agents for content transmission. In dense multiple access cellular networks, this framework utilizes the Dual-Clip PPO algorithm to minimize transmission delay by optimizing multi-user association, cache placement, UAV trajectory, and transmission power. Ref. [19] proposed a cooperative multi-agent actor–critic (MAAC) algorithm that enables UAVs and BSs to collaborate on caching decisions, considering user mobility to maximize CHP. However, as the number of agents in the system increases, current MADRL methods face the curse of dimensionality. The cache and communication resource allocation problem is characterized by high-dimensional parameters, resulting in excessive data processing requirements and complicating the task of finding efficient configurations [20]. The algorithms proposed in [8,10,13,14,15,16,17,18,19] face scalability issues due to the exponential growth in the joint action space as the number of agents increases.

Existing studies have not simultaneously considered heterogeneous services, mixed action spaces, and system scalability, which provides a significant motivation for this research. The network dynamics and node heterogeneity complicate the analysis of communication performance for cache offloading requests. Thus, it is essential to develop a communication model for content distribution among different types of cache nodes and analyze the CHP of various communication modes. To maximize the CHP, we need to develop a multi-agent framework that includes both GUs and UAVs, allowing them to coexist and interact, which differs from [10,13,14,15,16] that considered only homogeneous agents. Typically, resource management in UAV-assisted communication systems poses a mixed-integer problem [11], requiring DRL algorithms to handle mixed action spaces. Furthermore, we need to address the scalability issue of MADRL as the number of agents increases in large-scale environments. Inspired by [21], we ensure effective learning by having each agent interact directly with a limited set of neighboring agents. The key innovations and contributions of this paper are summarized as follows:

(1) We propose a UAV-assisted heterogeneous D2D caching network, jointly optimizing caching strategies, flight trajectories, and transmission power to achieve high CHP and reduce transmission delay. Taking into account the channel model and GU mobility, we derive the successful transmission probability (STP) and CHP.

(2) We introduce a scalable heterogeneous multi-agent mean-field actor–critic (SH-MAMFAC) framework to maximize total CHP and reduce training time. This framework includes homogeneous caching users (CUs) and heterogeneous UAVs, effectively addressing the exponential increase in agent interactions in traditional MADRL and enhancing scalability.

(3) We assess the performance of different algorithms by analyzing CHP and transmission delay across various system parameters. Our comprehensive quantitative comparison includes SH-MAMFAC, MADDPG-based algorithms, and traditional centralized and independent DRL algorithms. In addition, the proposed joint design scheme significantly reduces transmission delay compared to variants without trajectory optimization or power control.

The paper is organized as follows: Section 2 introduces UAV-assisted D2D caching networks, deriving the STP and CHP. In Section 3, we define the CHP maximization problem and propose the SH-MAMFAC algorithm to address it. Section 4 covers the presentation and analysis of simulation results, and Section 5 offers conclusions. For convenience, the commonly used abbreviations in this paper are compiled in Table A1, while the primary symbols and variables are summarized in Table 2.

2. System Model

Figure 1 illustrates the network structure, comprising an MBS, multiple GUs, and a UAV. The GUs are categorized into CUs and RUs. The maximum communication distance between D2D devices is

R_{d}

, allowing CUs to transmit files to RUs only when the distance is less than

R_{d}

. There are I CUs, denoted as

I = {1, \dots, I}

, each with a cache size of

S_{i}

Mbits. J RUs are denoted as

J = \{1, \dots, J\}

. The file library

F

contains F files, each of size

S_{f}

Mbits.

Content popularity can be modeled as Zipf distributions, as demonstrated in various real datasets [22]. Although content popularity follows a Zipf distribution, requests exhibit significant spatio-temporal variation due to differing preferences among GUs [23]. The probability that RU j requests file f in time slot t is [23]

p_{j, f} (t) = p_{f} \frac{g (ϕ_{j, t}, ϑ_{f})}{\sum_{j^{'} \in J} g (ϕ_{j^{'}, t}, ϑ_{f})},

(1)

where

p_{f} = (1 / f^{ξ}) / (\sum_{f \in F} (1 / f^{ξ}))

represents the popularity of file f, and

ξ

is the Zipf slope parameter.

g (ϕ_{j, t}, ϑ_{f})

is a kernel function representing the correlation between RU j and file f, with features

ϕ_{j, t}

and

ϑ_{f}

. The request variable

r e q_{j, f} (t)

is defined as

r e q_{j, f} (t) = 1

if

p_{j, f} (t) > ε

and

r e q_{j, f} (t) = 0

otherwise, where

ε

is the file popularity threshold.

2.1. UAV Mobility Model

Assuming the UAV operates at an altitude of H with a maximum communication range of

R_{u}

, it employs a time-division multiple access (TDMA) protocol to transmit content to RUs within a fixed duration of T seconds [24]. The UAV’s cache storage capacity is

S_{u}

Mbits. For clarity, we use a Cartesian coordinate system with the MBS at

w_{m} = [x_{m}, y_{m}, h_{m}]

. At time slot t, the position of CU i is represented by

w_{i} (t) = {[x_{i} (t), y_{i} (t)]}_{i \in I}

, while the position of RU j is denoted as

w_{j} (t) = {[x_{j} (t), y_{j} (t)]}_{j \in J}

. The duration T is divided into N time slots of

δ_{t} = T / N

, allowing the position of the UAV to remain approximately stable throughout each slot. The flight trajectory is represented as

Q = [q (1), \dots, q (N)]

, where

q (t) = {[x (t), y (t)]}^{T}

denotes the discrete-time position at time slot t.

At the beginning of time slot t, the UAV flies in a horizontal direction, determined by the angle

θ_{u} (t) \in [0, 2 π]

with a distance of

d_{u} (t) \in [0, d_{max}]

, where

d_{max}

represents the maximum distance the UAV can travel in one time slot. The UAV’s coordinates at time slot t are represented as

x (t) = x (0) + \sum_{t = 0}^{t} d_{u} (t) cos (θ_{u} (t))

and

y (t) = y (0) + \sum_{t = 0}^{t} d_{u} (t) sin (θ_{u} (t))

. The UAV must remain within the boundary of the target area defined by

[0, x_{max}] \times [0, y_{max}]

. Additionally, to provide reliable service to RUs, the UAV needs to return to its initial position by the end of each cycle T. This results in the following movement constraints for the UAV [17,24,25]:

\begin{matrix} 0 ⩽ θ_{u} (t) ⩽ 2 π, \end{matrix}

(2a)

\begin{matrix} 0 ⩽ x (t) ⩽ x_{max}, \end{matrix}

(2b)

\begin{matrix} 0 ⩽ y (t) ⩽ y_{max}, \end{matrix}

(2c)

\begin{matrix} 0 ⩽ d_{u} (t) ⩽ d_{max}, \end{matrix}

(2d)

\begin{matrix} q (1) = q (N) . \end{matrix}

(2e)

2.2. Communication Model

(1) D2D communication model: In contrast to UAV communication, ground channels between D2D users typically experience additional fading with random fluctuations [26]. Assuming the channel coefficients between D2D users remain constant within a time slot, they can vary from one slot to another. The average channel power gain from CU i to RU j during time slot t is

β_{i, j} (t) = β_{0} d_{i, j}^{- α_{0}} (t) = β_{0} / (∥w_{i} (t) - w_{j} (t)∥)^{α_{0}}

, where

d_{i, j} (t)

denotes the distance from CU i to RU j, and

β_{0}

represents the reference channel power gain at

d_{0} = 1

m.

α_{0}

is the path loss exponent for the D2D link, specifically set to 2 [27]. The channel gain from CU i to RU j during time slot t is

h_{i, j} (t) = \sqrt{β_{i, j} (t)} g_{i, j} (t)

, where

g_{i, j} (t)

represents the small-scale channel fading between CU i and RU j, distributed as

C N (0, 1)

[26].

{|h_{i, j} (t)|}^{2}

follows an exponential distribution with a mean of

γ_{i, j}

.

During time slot t, the data rate from CU i to RU j is

r_{i, j} (t) = B_{d} {log}_{2} (1 + \frac{P_{i} (t) {|h_{i, j} (t)|}^{2}}{I_{i, j} (t) + N_{0}}),

(3)

where

P_{i} (t)

denotes the transmission power of CU i at time slot t, constrained by a peak value

P_{max}

and an average value

\bar{P}

.

B_{d}

is the bandwidth of the CUs.

I_{i, j} (t) = \sum_{i^{'} = 1, i^{'} \neq i}^{I} P_{i^{'}} (t) {|h_{i^{'}, j} (t)|}^{2}

denotes the interference caused by all CUs except CU i and follows an exponential distribution with a mean of

λ_{i, j}

.

N_{0}

represents the noise power.

The cumulative distribution function (CDF) of

r_{i, j} (t)

is

\begin{matrix} F_{r_{i, j} (t)} (r) & = \Pr \{r_{i, j} (t) \leq r\} \\ = \Pr \{B_{d} {log}_{2} (1 + \frac{P_{i} (t) {|h_{i, j} (t)|}^{2}}{I_{i, j} (t) + N_{0}}) \leq r\} \\ = \Pr \{\frac{P_{i} (t) {|h_{i, j} (t)|}^{2}}{I_{i, j} (t) + N_{0}} \leq 2^{\frac{r}{B_{d}}} - 1\} . \end{matrix}

(4)

Due to GU mobility, node pairs needing data transmission may not always be in contact. We assume GUs enter and exit a finite area following a Poisson random process [28]. The contact duration

T_{i, j}

between CU i and RU j is modeled as a continuous random variable with an exponential distribution and mean

μ_{i, j}

[29]. For CU i to establish a link with RU j, the probability of successfully transmitting a file of size

S_{f}

Mbits within duration

T_{i, j}

must exceed

v_{min}

. This probability, referred to as the STP, is denoted

\Pr_{i, j} (t)

.

Lemma 1.

Assuming

e_{1}

and

e_{2}

follow independent exponential distributions,

x = ς_{1} e_{1}

,

y = ς_{2} e_{2} + 1

, and

ς_{1}, ς_{2} > 0

, the CDF of the random variable

z = x / y

is expressed as

F_{z} (z) = 1 - \frac{ς_{1}}{ς_{1} + ς_{2} z} exp (- \frac{z}{ς_{1}}) .

(5)

Proof.

The proof is presented in Appendix B. □

Let

y = \frac{P_{i} (t) {|h_{i, j} (t)|}^{2}}{I_{i, j} (t) + N_{0}}

,

ς_{1} = \frac{P_{i} (t) γ_{i, j}}{N_{0}}

, and

ς_{2} = \frac{λ_{i, j}}{N_{0}}

. Based on Lemma 1, the CDF of

r_{i, j} (t)

is

\begin{matrix} F_{r_{i, j} (t)} (r) = F_{y} (2^{\frac{r}{B_{d}}} - 1) = 1 - \frac{ς_{1}}{ς_{1} + ς_{2} y} exp (- \frac{2^{\frac{r}{B_{d}}} - 1}{ς_{1}}) . \end{matrix}

(6)

Through the derivation presented in Appendix C, we can obtain the STP for the transmission of file f from CU i to RU j as follows:

\begin{matrix} \Pr_{i, j} (t) & = \Pr \{T_{i, j} \geq \frac{S_{f}}{r_{i, j} (t)}\} \\ = \int_{0}^{+ \infty} \frac{ς_{1}}{μ_{i, j} \cdot (ς_{1} + ς_{2} (2^{\frac{S_{f}}{t \cdot B_{d}}} - 1))} \times exp (- (\frac{2^{\frac{S_{f}}{t \cdot B_{d}}} - 1}{ς_{1}} + \frac{t}{μ_{i, j}})) d t . \end{matrix}

(7)

(2) UAV communication model: Due to the altitude of the UAV and the complexity of the environment, the wireless channels between the UAV and GUs, as well as between the MBS and the UAV, are subject to probabilistic line-of-sight (LoS) and non-line-of-sight (NLoS) conditions [30]. The path loss from the UAV to RU j can be described by both LoS and NLoS models as follows [31]:

\begin{matrix} {PL}_{LoS, u, j} (t) = 20 log (\frac{4 π f_{c} d_{u, j} (t)}{c}) + η_{LoS}, \end{matrix}

(8a)

\begin{matrix} {PL}_{NLoS, u, j} (t) = 20 log (\frac{4 π f_{c} d_{u, j} (t)}{c}) + η_{NLoS}, \end{matrix}

(8b)

where

d_{u, j} (t)

represents the distance between the UAV and RU j at time slot t. The carrier frequency is denoted by

f_{c}

, and c represents the speed of light. Additionally,

η_{LoS}

and

η_{NLoS}

represent the additional losses for the LoS and NLoS scenarios, respectively.

The probability of a LoS link between the UAV and RU j is determined by their elevation angles and environmental factors [17] and can be expressed as

\Pr_{LoS, u, j} (t) = \frac{1}{1 + c_{1} exp [- c_{2} (θ_{u, j} (t) - c_{1})]},

(9)

where

θ_{u, j} (t)

represents the elevation angle between the UAV and RU j at time slot t,

c_{1}

and

c_{2}

are environment-dependent constants. Then, the NLoS probability is

\Pr_{NLoS, u, j} (t) = 1 - \Pr_{LoS, u, j} (t)

. Therefore, the average path loss from the UAV and RU j can is

{PL}_{u, j} (t) = \Pr_{LoS, u, j} (t) \times {PL}_{LoS, u, j} (t) + \Pr_{NLoS, u, j} (t) \times {PL}_{NLoS, u, j} (t) .

(10)

If the distance is sufficiently large, interference can be neglected in millimeter-wave UAV networks [13]. During time slot t, the data rate from the UAV to RU j is given by

r_{u, j} (t) = B_{u} {log}_{2} (1 + \frac{P_{u} {|g_{u, j}|}^{2} {PL}_{u, j} (t)}{N_{0}}),

(11)

where

B_{u}

denotes the bandwidth of the UAV.

P_{u}

denotes the transmission power of the UAV.

{|g_{u, j} (t)|}^{2}

represents the Rayleigh channel gain from the UAV to RU j and follows an exponential distribution with unit mean [31].

During time slot t, the data rate from the MBS to the UAV is

r_{m, u} (t) = B_{m} {log}_{2} (1 + \frac{P_{m} {|g_{m, u}|}^{2} {PL}_{m, u} (t)}{N_{0}}),

(12)

where

B_{m}

represents the bandwidth of the MBS.

P_{m}

represents the transmission power of the MBS. The small-scale fading between the MBS and the UAV is represented by

{|g_{m, u} (t)|}^{2}

, while the path loss in this link is indicated by

{PL}_{m, u} (t)

.

2.3. Transmission Delay Model

The cache placement for CU i and the UAV at time slot t is represented by the vectors

c_{i} (t)

and

c_{u} (t)

, respectively. If a file is cached at a CU or the UAV, the corresponding entry in the cache placement vector is set to 1; otherwise, it is 0. When RU j requests file f at time slot t, it first checks if any CU can establish a D2D link. The set of CUs that meet the D2D link conditions is denoted as

G_{j, f} (t) = {i ∣ \Pr_{i, j} (t) \geq v_{min}, c_{i, f} (t) = 1, d_{i, j} (t) \leq R_{d}, i \in I}

. If no CU satisfies these conditions, the file is requested from the UAV.

The downlink wireless transmission delay from CU i to RU j for transmitting file f is

T_{i, j, f} (t) = \frac{S_{f}}{r_{i, j} (t)} .

(13)

For UAV communication, the transmission delay consists of two components: the downlink wireless delay and the backhaul link delay [2]. The downlink wireless delay from the UAV to RU j is

T_{1, u, j, f} (t) = \frac{S_{f}}{r_{u, j} (t)} .

(14)

While the backhaul link delay from the MBS to the UAV for sending file f is

T_{2, u, j, f} (t) = \frac{S_{f} \cdot (1 - c_{u, f} (t))}{r_{m, u} (t)} .

(15)

If the content requested by RU j is already stored on the UAV, the backhaul link will not be required, that is, when

c_{u, f} (t) = 1

,

T_{2, u, j, f} (t) = 0

.

Thus, the overall transmission delay for RU j in time slot t is

\begin{matrix} T_{j} (t) = \sum_{f \in F} [I_{j, f} (t) \cdot min_{i \in G_{j, f}} (T_{i, j, f} (t)) + (1 - I_{j, f} (t)) \cdot T_{u, j, f} (t)], \end{matrix}

(16)

where

I_{j, f} (t) = 1

indicates that file f is available from CUs via D2D, while

I_{j, f} (t) = 0

means the file is requested from the UAV, resulting in

T_{u, j, f} (t) = T_{1, u, j, f} (t) + T_{2, u, j, f} (t)

.

2.4. UAV Energy Consumption Model

In this work, the energy consumption of the UAV includes two components: the energy required for propulsion during movement and the energy needed for content transmission.

The power required for a rotary-wing UAV to travel at a speed V is [32]:

\begin{matrix} P (V) = & P_{0} (1 + \frac{3 V^{3}}{U^{2}}) + P_{1} {({(1 + \frac{V^{4}}{4 v_{r}^{4}})}^{\frac{1}{2}} - \frac{V^{2}}{2 v_{r}^{2}})}^{\frac{1}{2}} + \frac{1}{2} d_{r} ρ s A V^{3}, \end{matrix}

(17)

where

P_{0}

and

P_{1}

are constants associated with the blade profile power and the induced power of the UAV while in a hovering state. U signifies the rotor blade tip speed,

v_{r}

is the average rotor-induced velocity in hover,

d_{r}

indicates the fuselage drag ratio, and

ρ

denotes air density. Additionally, s and A correspond to the rotor solidity and rotor disc area, respectively [25]. The power required for hovering is

P_{h} = P (V = 0) = P_{0} + P_{1}

.

When the UAV travels at a constant speed V, the energy used for propulsion in one time slot can be represented as

E (x) = \frac{x}{V} P (V) + max \{δ_{t} - \frac{x}{V}, 0\} \cdot P_{h},

(18)

where

x = ∥q (t + 1) - q (t)∥

represents the distance covered by the UAV in one time slot.

The communication energy consumption from the UAV to RUs for transmitting file f is given by [23]

E_{u} (t) = \sum_{j = 1}^{J} \sum_{f = 1}^{F} P_{u} \cdot min \{δ_{t}, T_{u, j, f} (t)\} .

(19)

2.5. Cache Hit Probability Model

In time slot t, the probability that RU j obtains file f from the CUs is

H_{d, j, f} (t) = r e q_{j, f} (t) \cdot \frac{\sum_{i \in G_{j, f}} c_{i, f} (t) \cdot \Pr_{i, j} (t)}{\sum_{i \in G_{j, f}} c_{i, f} (t)} .

(20)

The total probability that RUs can retrieve the file from the CUs is

H_{d} (t) = \frac{\sum_{j = 1}^{J} \sum_{f = 1}^{F} I_{j, f} (t) \cdot H_{d, j, f} (t)}{\sum_{j = 1}^{J} \sum_{f = 1}^{F} r e q_{j, f} (t)} .

(21)

In time slot t, the probability that RU j acquires file f from the UAV is

H_{u, j, f} (t) = r e q_{j, f} (t) \cdot c_{u, f} (t) .

(22)

The total probability that RUs can retrieve the file from the UAV is

H_{u} (t) = \frac{\sum_{j = 1}^{J} \sum_{f = 1}^{F} (1 - I_{j, f} (t)) \cdot H_{u, j, f} (t)}{\sum_{j = 1}^{J} \sum_{f = 1}^{F} r e q_{j, f} (t)} .

(23)

The total CHP at time slot t is defined as the probability that the content requested by RU is present in either the CU cache or the UAV cache, expressed as

H (t) = 1 - (1 - H_{d} (t)) \cdot (1 - H_{u} (t)) .

(24)

3. Problem Formulation and Optimization

3.1. Problem Formulation

We propose a joint optimization framework to maximize the overall CHP by simultaneously optimizing the UAV trajectory

q (t)

, the CUs’ transmission power

p (t) = [P_{1} (t), \dots, P_{I} (t)]

, the CUs’ cache placement

C_{d} (t) = [c_{1} (t), \dots, c_{I} (t)]

, and the UAV’s cache placement

c_{u} (t)

. The optimization problems are defined as follows:

\begin{matrix} P_{1} : & max_{q (t), p (t), C_{d} (t), c_{u} (t)} \frac{1}{N} \sum_{t = 1}^{N} H (t) \end{matrix}

(25a)

\begin{matrix} s . t . & \sum_{f = 1}^{F} c_{i, f} (t) \cdot S_{f} ⩽ S_{i}, \forall i, t, \end{matrix}

(25b)

\begin{matrix} \sum_{f = 1}^{F} c_{u, f} (t) \cdot S_{f} ⩽ S_{u}, \forall t, \end{matrix}

(25c)

\begin{matrix} 0 ⩽ θ_{u} (t) ⩽ 2 π, \end{matrix}

(25d)

\begin{matrix} 0 ⩽ x (t) ⩽ x_{max}, \end{matrix}

(25e)

\begin{matrix} 0 ⩽ y (t) ⩽ y_{max}, \end{matrix}

(25f)

\begin{matrix} 0 ⩽ d_{u} (t) ⩽ d_{max}, \end{matrix}

(25g)

\begin{matrix} q (1) = q (N), \end{matrix}

(25h)

\begin{matrix} \sum_{t = 1}^{N} (E (∥q (t) - q (t - 1)∥) + E_{u} (t)) \leq E_{u, max}, \end{matrix}

(25i)

\begin{matrix} P_{i} (t) \leq P_{max}, \forall i, t, \end{matrix}

(25j)

\begin{matrix} \frac{1}{N} \sum_{t = 1}^{N} P_{i} (t) \leq \bar{P}, \forall i, \end{matrix}

(25k)

\begin{matrix} T_{j} (t) \leq δ_{t}, \forall j, t, \end{matrix}

(25l)

where

E_{u, max}

denotes the total onboard energy available to the UAV. Equations (25b) and (25c) limit the maximum cached size for the CUs and UAV, respectively. Equations (25d)–(25h) impose constraints on the UAV trajectory. Equation (25i) imposes a cap on the highest amount of energy that the UAV is allowed to use. Equations (25j) and (25k) affect the power allocation for the CU. Equation (25l) states that the total transmission delay within each time slot must not exceed the duration of the time slot.

P_{1}

represents a non-convex mixed-integer programming challenge, categorized as NP-hard [33]. Typically, a brute-force method is employed to identify the optimal solution. However, this approach is computationally intensive and impractical for large-scale systems [10]. Additionally, obtaining the optimal solution for the next time slot requires full knowledge of future information, meaning performance may be compromised without it. Therefore, we consider using a DRL-based approach to estimate the optimal solution when this information is unavailable.

3.2. Partially Observable Markov Game

Due to the optimization goal of maximizing the total CHP, we convert the

P_{1}

into a partially observable Markov game involving

I + 1

agents, comprising I CU agents and 1 UAV agent. This can be described by a tuple

〈 S, A, R, P, γ 〉

, where

S

denotes the collection of states defining the environment;

A = \{A_{1}, \dots, A_{I + 1}\}

denotes the joint action set encompassing all agents;

R = \{R_{1}, \dots, R_{I + 1}\}

denotes the set of reward functions that assigns rewards according to states and the actions of the agents, i.e.,

R : S \times A \mapsto R^{I + 1}

;

P

denotes the state transition function, which specifies the probability of moving to the subsequent state given the joint actions taken by all agents in the current state, i.e.,

P : S \times A_{1} \times \dots \times A_{I + 1} \times S \mapsto [0, 1]

; and

γ

represents the factor for discounting future rewards.

During each time slot t, the agents are able to engage with the environment, where the state of the environment is represented by

s_{t}

, with

s_{t} \in S

. Consequently, agent k (we use k as the index of agents in the entire heterogeneous system, where

k \in {1, \dots, I + 1}

; when

k = I + 1

, the agent corresponds to the UAV; otherwise, the agent is a CU) receives local observations of the environment, denoted as

o_{k, t}

, determined by observation functions

o_{k, t} = b_{k} (s_{t})

. Subsequently, agent k takes actions

a_{k, t} = π_{k} (o_{k, t})

, where

π_{k} (\cdot)

represents the strategy functions of agent k. Let

o_{t}

represent the set of observations

{o_{1, t}, \dots, o_{I + 1, t}}

, and

a_{t}

represent the set of actions

{a_{1, t}, \dots, a_{I + 1, t}}

. As illustrated in Figure 2, each agent obtains local observations and selects actions according to its policy. After taking joint action

a_{t}

, the agent obtains the associated reward

r_{t} = \{r_{1, t}, \dots, r_{I + 1, t}\}

according to the reward function

R

. Furthermore, based on the state transition probability function

P

, the joint observation updates from

o_{t}

to

o_{t + 1}

.

3.3. Markov Decision Process

The SH-MAMFAC framework introduces two types of heterogeneous multi-agents. The first type includes all homogeneous CUs.

(1) Observation space: The observation space for the CU agent includes the transmission power and caching decisions of the CU from the previous time slot, as well as its own position, the positions of all RUs (RUs equipped with GPS can report their positions to the MBS; the MBS then distributes this location information to all CUs [17]), and the request distribution for each RU during the present time slot. Therefore, the observation space for the CU is

o_{i, t} = \{P_{i} (t - 1), c_{i} (t - 1), w_{i} (t), {w_{j} (t)}_{j \in J}, {p_{j, f} (t)}_{j \in J \cup f \in F}\} .

(26)

(2) Action space: In our scenario, the action space

a_{i, t}

is a combination of discrete and continuous components. It includes discrete sub-actions related to caching decisions

c_{i} (t)

and continuous sub-actions for transmission power

P_{i} (t)

. In the SH-MAMFAC framework, the actor network produces a normalized continuous action space

a_{con, i, t} = \{c_{i, t}^{'}, P_{i, t}^{'}\}

, which requires further processing before it can be applied in the environment. Therefore, we design a mixed-action mapping strategy to address this issue. First, we transform the output

c_{i, t}^{'}

into a probability distribution

{\hat{c}}_{i, t}^{'} = {\hat{π}}_{i} (o_{i, t}; {\hat{Θ}}_{π, i})

, where the fth element of

{\hat{c}}_{i, t}^{'}

represents the probability of CU i caching file f.

{\hat{π}}_{i}

is the actor network responsible for outputting the probabilities. We then sample from this probability distribution to obtain the actual cache decision

c_{i} (t) = ψ ({\hat{c}}_{i, t}^{'})

, with

ψ (\cdot)

representing the sampling process. This approach effectively maps the generated cache actions to the discrete action space [8]. Additionally,

P_{i, t}^{'}

is mapped to the real value by multiplying by its maximum value

P_{max}

. Thus, it effectively addresses the issue of hybrid action space.

The second category of agents comprises the UAV, which differs from the CUs in terms of heterogeneity:

(1) Observation space: The observation space for the UAV agent includes the UAV caching decisions from the previous time slot, as well as its own position, the positions of all RUs, and the request distribution for each RU during the present time slot. Therefore, the observation space for the UAV is

\begin{matrix} o_{I + 1} (t) = \{c_{u} (t - 1), q (t), {w_{j} (t)}_{j \in J}, {p_{j, f} (t)}_{j \in J \cup f \in F}\} . \end{matrix}

(27)

(2) Action space: Similar to the action space for CU agents, the action space of the UAV agent

a_{I + 1, t}

includes an integer sub-action for the caching decision

c_{u} (t)

, as well as continuous sub-actions

θ_{u} (t)

and

d_{u} (t)

. We transform normalized continuous output

c_{u, t}^{'}

into a probability distribution

{\hat{c}}_{u, t}^{'}

and then sample from it to obtain the actual cache decision

c_{u} (t) = ψ ({\hat{c}}_{u, t}^{'})

. The continuous sub-actions

θ_{u, t}^{'}

and

d_{u, t}^{'}

are multiplied by their maximum values

2 π

and

d_{max}

, respectively, to ensure effective application in the actual operation.

Rewards guide each agent toward its optimal policy, and when the reward signals at each step are aligned with the desired objectives, system performance can be enhanced [34]. Given that shaped rewards can accelerate algorithm learning compared to sparse rewards, we designed the reward element

r_{i, t + 1} = \frac{\sum_{j = 1}^{J} \sum_{f = 1}^{F} I_{j, f} (t) \cdot c_{i, f} (t) \cdot \Pr_{i, f} (t) \cdot r e q_{j, f} (t)}{\sum_{j = 1}^{J} \sum_{f = 1}^{F} r e q_{j, f} (t)}

for CU i, which can be derived from (20) and (21). Similarly, based on (23), the reward element for the UAV is

r_{I + 1, t + 1} = H_{u} (t)

. Given the shared goal of the optimization problem, all

I + 1

agents must work together to optimize the total CHP. To achieve a cooperative Markovian game, we assume that each agent receives the same immediate reward [18,34], i.e.,

r_{1, t} = \dots = r_{I + 1, t} = r_{t}

. To reduce significant fluctuations in rewards, we define

r_{t} = \frac{1}{J} \sum_{j = 1}^{J} r_{j, t}

, where

r_{j, t}

represents the reward obtained by RU j during time slot t, and is defined as

r_{j} (t) = \{\begin{matrix} \sum_{f = 1}^{F} H_{d, j, f} (t), & if I_{j, f} (t) = 1, \\ \sum_{f = 1}^{F} H_{u, j, f} (t), & if I_{j, f} (t) = 0 . \end{matrix}

(28)

3.4. Scalable Heterogeneous Multi-Agent Mean-Field Actor–Critic

As illustrated in Figure 3, each agent utilizes an actor network

π_{k} (o_{k}; Θ_{π, k})

with weights

Θ_{π, k}

for decentralized execution. The actor network processes local observations to generate actions as outputs. In conventional MADRL, the action–value function, i.e., the Q-function, calculates the expected discounted rewards for given states and the actions taken by the agents. Agents optimize their behavior by learning these Q-functions. However, learning Q-functions becomes exceedingly difficult when the dimension of joint actions grows significantly [21,35]. To tackle scalability challenges in large-scale MADRL, we approximate the standard Q-function of agent k with a mean-field Q-function as [36]

Q_{k} (o, a) = \frac{1}{|K_{k_{1}}|} \sum_{k_{1}} Q_{k} (o, a_{k}, a_{1, k_{1}}) + \frac{1}{|K_{k_{2}}|} \sum_{k_{2}} Q_{k} (o, a_{k}, a_{2, k_{2}}),

(29)

where

k_{1} \in K_{k_{1}}

and

k_{2} \in K_{k_{2}}

are the index sets of neighboring CU agents and UAV agents (in this scenario, we consider a single UAV; however, this approach can be easily extended to multiple UAVs, which will be covered in our future work) associated with agent k, respectively.

K_{k} = K_{k_{1}} \cup K_{k_{2}}

represents the set of neighboring agents for agent k determined by specific application settings. And (29) can be rewritten as

Q_{k} (o, a) = \frac{1}{|K_{k}|} \sum_{k} Q_{k} (o, a_{k}, a_{1, k_{1}}, a_{2, k_{2}}) .

(30)

It is important to note that the pairwise approximation between an agent and its connected neighbors preserves the global interaction between any two agents. Mean-field theory simplifies the interactions of multiple agents to those of two agents, where the second agent represents the average effect of all other agents [37]. Then, Theorem 1 is founded for the Q-function:

Theorem 1.

Q-function

Q_{k} (o, a)

of agent k can be approximated by mean action

{\bar{a}}_{l, k}

and action

a_{k}

, i.e.,

Q_{k} (o, a) = \frac{1}{|K_{k}|} \sum_{k} Q_{k} (o, a_{k}, a_{1, k_{1}}, a_{2, k_{2}}) \approx Q_{k} (o, a_{k}, {\bar{a}}_{l, k}) .

(31)

where

{\bar{a}}_{l, k}

(

l \in {1, 2}

) denotes the mean action of agents belonging to class l in the neighborhood of agent k.

Proof.

The proof is presented in Appendix D. □

The mean-field Q-function can be updated recursively as follows:

Q_{k, t + 1} (o, a_{k}, {\bar{a}}_{l, k}) = (1 - α_{t}) Q_{k, t} (o, a_{k}, {\bar{a}}_{l, k}) + α_{t} [r_{k} + γ v_{k, t} (o^{'})],

(32)

where

α_{t}

is the learning rate. The value function

v_{k, t} (o^{'})

of agent k at time t is defined as

v_{k, t} (o^{'}) = \sum_{a_{k}} π_{k, t} (a_{k} | o^{'}, {\bar{a}}_{l, k}) E_{a_{l, - k} \sim π_{- k, t}} [Q_{k, t} (o, a_{k}, {\bar{a}}_{l, k})] .

(33)

We compute the mean actions of heterogeneous agents using the following equation:

{\bar{a}}_{l, k} = \frac{1}{|K_{k_{l}}|} \sum_{k_{l}} a_{l, k_{l}}, a_{l, k_{l}} \sim π_{k_{l}, t} (\cdot ∣ o, {\bar{a}}_{l -, k_{l}}),

(34)

where

π_{k_{l}, t}

denotes the policy of agent

k_{l}

(in the neighborhood of k), and

{\bar{a}}_{l -, k_{l}}

represents the previous mean action of the neighbors of class l for agent k. Then, the Boltzmann policy for each agent k is

π_{k, t} (a_{k} | o, {\bar{a}}_{l, k}) = \frac{exp (β Q_{k, t} (o, a_{k}, {\bar{a}}_{l, k}))}{\sum_{a_{k^{'}} \in A_{k}} exp (β Q_{k, t} (o, a_{k^{'}}, {\bar{a}}_{l, k}))},

(35)

where

β

denotes the Boltzmann softmax parameter.

By iterating Equations (33)–(35), the mean actions of all agents and their corresponding policies are continuously improved. In the proof of Theorem 2, we demonstrate that this method converges to a fixed point, which is boundedly close to the Nash equilibrium. We first present three assumptions regarding mean-field updates, followed by Theorem 2.

Assumption 1.

Each action–value pair is visited infinitely often, while the rewards are bounded by a constant.

Assumption 2.

The agent’s policy is known as Greedy in the Limit with Infinite Exploration (GLIE). With the Boltzmann policy, as the temperature decreases, the policy becomes greedy with respect to the Q-function in the limit.

Assumption 3.

For each stage game

[Q_{1, t} (o), \dots, Q_{I + 1, t} (o)]

at time t and state s. For all

t, s, k \in 1, \dots, I + 1

, the Nash equilibrium

π_{*} = [π_{1, *}, \dots, π_{I + 1, *}]

is regarded as the global optimal solution or saddle point, as shown below:

1 . E_{π_{*}} [Q_{k, t} (o)] ⩾ E_{π} [Q_{k, t} (o)], \forall π \in Ω (Π_{k_{l}} A_{k_{l}}),

(36)

2 . E_{π_{*}} [Q_{k, t} (o)] ⩾ E_{π_{k}} E_{π_{- k, *}} [Q_{k, t} (o)], \forall π_{k} \in Ω (A_{k}),

(37)

3 . E_{π_{*}} [Q_{k, t} (o)] ⩽ E_{π_{k, *}} E_{π_{- k, *}} [Q_{k, t} (o)], \forall π_{- k} \in Ω (Π_{k^{l} \neq k} A_{k_{l}}) .

(38)

Theorem 2.

Based on (32)–(35), under the conditions of Assumptions 1–3, updating the mean-field Q-function will converge to a fixed point that is boundedly close to the Nash equilibrium.

Proof.

The proof is presented in Appendix E. □

During the centralized training process, each agent stores experience

〈 o, a, r, o^{'}, {\bar{a}}_{l} 〉

in a replay buffer

M_{k}

and randomly samples mini-batches from it to update their parameters. This process aims to ensure system stability and improve sample efficiency. Consider the m-th transition as an example; the critic network samples a batch of

B_{k}

transitions from

M_{k}

and minimizes the loss function to adjust the parameters of the evaluation network using the learning rate

ρ_{Q}

:

L (Θ_{Q, k}) = \frac{1}{|B_{k}|} \sum_{m} [{(y_{k}^{m} - Q_{k} (o^{m}, a_{k}^{m}, {\bar{a}}_{l, k}^{m}; Θ_{Q, k}))}^{2}],

(39)

where

y_{k}^{m} = r_{k}^{m} + γ Q_{k}^{'} (o^{m^{'}}, a_{k}^{m^{'}}, {\bar{a}}_{l, k}^{m^{'}}; Θ_{Q^{'}, k})

represents the estimated target value using the target critic network. On the other hand, the parameters of the actor network are updated using the learning rate

ϱ_{π}

by maximizing the policy objective function:

\nabla_{Θ_{π, k}} J (Θ_{π, k}) \approx \frac{1}{|B_{k}|} \sum_{m} [\nabla_{Θ_{π, k}} log π_{k} (o^{m}; Θ_{π, k}) (y_{k}^{m} - Q_{k} (o^{m}, a_{k}^{m}, {\bar{a}}_{l, k}^{m}; Θ_{Q, k}))] .

(40)

As

Θ_{π, k}

and

Θ_{Q, k}

are updated in real time, the parameters of the target networks

Θ_{π^{'}, k}

and

Θ_{Q^{'}, k}

can also be smoothly updated as follows:

\begin{matrix} Θ_{π^{'}, k} & = κ Θ_{π, k} + (1 - κ) Θ_{π^{'}, k}, \end{matrix}

(41a)

\begin{matrix} Θ_{Q^{'}, k} & = κ Θ_{Q, k} + (1 - κ) Θ_{Q^{'}, k} . \end{matrix}

(41b)

where

κ

is the update rate.

As depicted in Figure 3, the SH-MAMFAC framework is comprised of an aerial–ground collaborative environment along with

I + 1

agents. During the centralized offline training phase, agent k is provided with the observations and actions of all other agents in addition to its own local observations. The actor network for CU i processes the local observation

o_{i, t}

and generates a normalized continuous action

a_{con, i, t}

. The UAV provides its observation

o_{I + 1, t}

to the actor network

π_{I + 1} (o_{I + 1}; Θ_{π, I + 1})

, which produces the normalized continuous action

a_{con, I + 1, t}

. Exploration noise

N

is required to encourage exploration and prevent policies from becoming trapped in local optima. The noise follows a normal distribution with a mean of zero and a standard deviation of

σ_{N}

. As the agents carry out their joint actions

a_{t}

, the environment progresses to the next state

s_{t + 1}

, and the agents receive rewards

r_{t}

and updated observations

o_{t + 1}

.

If the UAV breaches trajectory or energy constraints, it faces a penalty in its reward, and the affected flight may be terminated. In the considered scenario, the total CHP is jointly determined by the actions of

I + 1

agents. Learning the Q-function with observations that include information about the actions of other agents aids in training. Since each agent can perceive the actions of other agents during the offline training phase, the environment is considered stable for each agent, addressing the challenges posed by a dynamic environment. In the execution phase, the actor network relies solely on local observations, allowing each agent to independently determine its actions without needing information about other agents. The training procedure for the proposed algorithm is outlined in Algorithm 1.

Algorithm 1 SH-MAMFAC Algorithm

1:: Input: Structures of actor and critic networks for agent k ( $k \in K = {1, \dots, I + 1}$ ), buffer capacity $|M|$ , batch size $|B|$ , and episode duration $\hat{T}$ .
2:: Output: Optimally trained parameters of actor networks for all agents.
3:: Initialize actor networks ${Θ_{π, k}}_{k \in K}$ , target actor networks ${Θ_{π^{'}, k}}_{k \in K}$ , critic networks ${Θ_{Q, k}}_{k \in K}$ , target critic networks ${Θ_{Q^{'}, k}}_{k \in K}$ .
4:: Set up experience replay buffer.
5:: for each episode do
6:: Establish initial positions of UAV and GUs.
7:: for each time slot t do
8:: for each agent k do
9:: Obtain observation $o_{k, t}$ .
10:: Choose action $a_{k, t} = π_{k} (o_{k, t}; Θ_{π, k}) + N$ .
11:: end for
12:: Each agent performs its actions.
13:: if UAV violates trajectory or energy constraints then
14:: Recalibrate UAV position.
15:: end if
16:: Agents receive rewards $r_{t}$ and updated observations $o_{t + 1}$ .
17:: Save $(o_{t}, a_{t}, r_{t}, o_{t + 1}, {\bar{a}}_{l, t})$ in $M$ .
18:: end for
19:: if experience replay memory is full then
20:: Draw mini-batches of size $|B|$ from the buffer.
21:: Adjust critic network parameters by minimizing the loss in (39).
22:: Optimize actor network by maximizing the policy objective in (40).
23:: Update target network parameters in (41).
24:: end if
25:: end for

3.5. Computational Complexity Analysis

The computational complexity of the SH-MAMFAC algorithm is influenced by the designs of the actor and critic networks for both the CUs and UAV. Specifically, assuming that the actor network for CU i is comprised of

L_{ac, i}

fully connected layers, and the critic network contains

L_{cr, i}

fully connected layers, the number of neurons in the l-th layer of the actor network for CU i is denoted as

n_{ac, i, l}

, while the number of neurons in the

l^{'}

-th layer of the critic network for CU i is denoted as

n_{cr, i, l^{'}}

. Thus, the computational complexity for CU i at each time step is

O (\sum_{l = 0}^{L_{ac, i} - 1} n_{ac, i, l} n_{ac, i, l + 1} + \sum_{l^{'} = 0}^{L_{cr, i} - 1} n_{cr, i, l^{'}} n_{cr, i, l^{'} + 1})

. Next, we analyze the computational complexity of the UAV network structure. Similarly, we assume that the UAV actor network comprises

L_{ac, UAV}

fully connected layers, while its critic network consists of

L_{cr, UAV}

fully connected layers. In the UAV actor network, the l-th layer is composed of

n_{ac, UAV, l}

neurons, whereas the

l^{'}

-th layer of the UAV critic network includes

n_{cr, UAV, l^{'}}

neurons. Thus, the computational complexity for the UAV at each time step is

O (\sum_{l = 0}^{L_{ac, UAV} - 1} n_{ac, UAV, l} n_{ac, UAV, l + 1} +

\sum_{l^{'} = 0}^{L_{cr, UAV} - 1} n_{cr, UAV, l^{'}} n_{cr, UAV, l^{'} + 1})

. This paper considers a set of agents comprising I CUs and one UAV. If the process of training the neural network weights has a computational cost of W, then the total complexity of the SH-MAMFAC algorithm is given by

O (W (I X + Y))

, where

X = \sum_{l = 0}^{L_{ac, i} - 1} n_{ac, i, l} n_{ac, i, l + 1} + \sum_{l^{'} = 0}^{L_{cr, i} - 1} n_{cr, i, l^{'}} n_{cr, i, l^{'} + 1}

and

Y = \sum_{l = 0}^{L_{ac, UAV} - 1} n_{ac, UAV, l} n_{ac, UAV, l + 1} + \sum_{l^{'} = 0}^{L_{cr, UAV} - 1} n_{cr, UAV, l^{'}} n_{cr, UAV, l^{'} + 1}

.

During the execution phase, the well-trained actor independently makes optimal decisions without relying on the actions or states of other agents, effectively reducing communication overhead. Each CU agent uses only the trained actor network and does not utilize the critic network. Thus, the execution complexity of the CU agent is defined by

O (\sum_{l = 0}^{L_{ac, i} - 1} n_{ac, i, l} n_{ac, i, l + 1})

. Similarly, the execution complexity of the UAV agent is defined as

O (\sum_{l = 0}^{L_{ac, UAV} - 1} n_{ac, UAV, l} n_{ac, UAV, l + 1})

.

4. Numerical Result

4.1. Simulation Settings

In a simulation of the proposed algorithm within a

200 \times 200 m^{2}

square area [7], the system consists of an MBS, a UAV, and several GUs. The mobility of GUs is modeled using a reference point group mobility model in [25], with the number of CUs set to

I = 50

and the number of RUs set to

J = 300

, unless otherwise specified. The height of the static MBS is

h_{m} = 10 m

, located at coordinates

[100, 100, 10]

. The initial flight height of the UAV is

H = 100 m

, with a departure direction randomly generated within the range

[0, π / 2]

, and a maximum movement distance in one time slot of

d_{\max} = 10 m

. In each time slot t, each RU requests a file f from a set of

F = 30

files with probability

p_{j, f} (t)

, where each file has a size of

S_{f} = 2 Mbits

[7]. The file popularity follows a Zipf distribution with a slope parameter set to

ξ = 0.8

[17]. Both CUs and the UAV have limited cache capacities, capable of storing

S_{i}

and

S_{u}

Mbits, respectively. The cache capacity for CUs is set to

S_{i} = 10 Mbits

, while

S_{u}

varies from

2 Mbits

to

20 Mbits

. Other system simulation parameters are summarized in Table 3.

We conducted experimental simulations on a laptop equipped with 16.0 GB of RAM and an Intel Core i7-11370H @ 3.30 GHz processor. The software platform used is Python 3.9 and PyTorch. ReLU is set as the activation function for each hidden layer. The Adam optimizer is utilized to update the actor and critic networks. The parameters for the proposed SH-MAMFAC algorithm’s actor and critic networks can also be found in Table 3. Additionally, the training process for our proposed algorithm consists of

\hat{T} = 6000

episodes, with each episode containing

N = 1000

time slots.

4.2. Performance Evaluation

In this subsection, we assess the performance of the SH-MAMFAC algorithm by comparing it with the following benchmarks:

(1) Unscalable heterogeneous MADDPG (UH-MADDPG) [8]: All CUs and a UAV are considered as different types of agents, each with a decentralized actor and a centralized critic. The critic of each agent can access the actions of all other agents, but as the number of agents increases, the joint action space grows exponentially, leading to performance limitations in large-scale environments due to high computational and communication overhead.

(2) Unscalable centralized DRL (UC-DRL) [11]: A single CU is selected as the central agent, responsible for receiving collective observation information from all CUs and making decisions for the entire group. As the number of agents increases, the communication overhead for information gathering grows significantly. Additionally, communication delays in large-scale environments may result in outdated state information, affecting the timeliness and accuracy of decision-making.

(3) Scalable independent DRL (SI-DRL) [38]: Each agent makes decisions and updates its policy based solely on its own local observations, without considering the actions or observations of other agents. This method can scale to large agent environments but lacks mechanisms for collaboration and information sharing between agents. Additionally, it may face non-stationarity issues, and the convergence of the learning policy cannot be guaranteed.

This section evaluates and compares the training efficiency and computational performance of four DRL methods. Figure 4 illustrates the cumulative reward trends over 6000 episodes, where the solid lines represent the moving average over 100 episodes, and the shaded areas indicate the corresponding variance. The markers on the lines show the convergence within 6000 episodes. Additionally, we recorded and compared the computation time per episode, the number of episodes required to converge, and the total computation time. The results are summarized in Table 4.

As depicted in Figure 4, all DRL algorithms demonstrate similar low episode rewards during the initial training phase, attributed to the agents exploring the environment without optimized control strategies. As learning progresses, the agents accumulate valuable experience, which is utilized for network updates, leading to refined strategies and an increase in rewards across all DRL methods. Notably, the simplest method, SI-DRL, exhibits the highest variance and unstable learning behavior, resulting in the lowest cumulative rewards. This instability stems from agents training independently, relying solely on individual observations and actions, while disregarding the actions of others, which in turn leads to a non-stationary environment. In contrast, both UC-DRL and UH-MADDPG address the issue of non-stationarity by incorporating information from all agents, thereby enhancing training performance. However, these methods raise concerns regarding privacy, as they necessitate the sharing of local observations and actions from all agents. Furthermore, UC-DRL becomes infeasible in large-scale environments (e.g., when

I = 50

), as it suffers from the curse of dimensionality, resulting in an inability to complete training within a 4-h period, with the process halting at approximately 5500 episodes. SH-MAMFAC, leveraging a mean-field approximation, efficiently captures the average interaction effects among agents, thus optimizing multi-agent collaboration and significantly improving scalability, ultimately achieving the highest reward levels.

The computational performance of the various methods is compared as presented in Table 4, revealing that SI-DRL exhibits the shortest per-episode computation time. This is primarily due to its requirement to process only local observations and action inputs. In contrast, UC-DRL demonstrates the longest computation time, largely attributed to its higher computational overhead and latency issues. As previously mentioned, UC-DRL failed to complete the training within 4 h, and the total computation time shown in the table corresponds only to 5500 episodes. In comparison, UH-MADDPG reduces training time effectively by employing a centralized training with decentralized execution (CTDE) framework. During the execution phase, agents can make decisions independently, resulting in more efficient utilization of computational resources and further reducing training time. In simpler environments without coordination, SI-DRL achieves convergence the fastest [37]. The proposed SH-MAMFAC method, through efficient cooperation among agents, better adapts to the dynamic nature of communication environments and achieves convergence within 1600 episodes, taking only 0.61 h. Its performance surpasses both UH-MADDPG and UC-DRL.

Figure 5 compares the CHP of the proposed SH-MAMFAC algorithm with all baseline methods under different RU densities. At the initial stage, with lower RU density, the CHP of all schemes increases due to sufficient system cache space and bandwidth. However, when the number of RUs exceeds 300, cache and transmission resources become strained, leading to a sharp drop in CHP for most strategies, indicating that traditional methods can effectively serve up to around 300 RUs. Notably, SH-MAMFAC consistently achieves the highest CHP across all RU counts. For instance, at

J = 300

, SH-MAMFAC’s CHP reached 91.36%, exceeding UH-MADDPG by 4.42%, UC-DRL by 15.84%, and SI-DRL by 38.99%. When

J = 500

, SH-MAMFAC’s CHP decreased to 81.32%, with a decline of 10.04%, significantly lower than the declines of UH-MADDPG (24.84%) and UC-DRL (26.86%), and slightly lower than SI-DRL (11.62%). This can be attributed to MAMFAC’s multi-agent cooperative learning, which enables efficient resource allocation and cache utilization. In contrast, baseline methods suffer from more severe performance degradation at high RU densities. UH-MADDPG adopts a CTDE framework, which results in an exponentially increasing joint action space as the number of agents grows, leading to increased computational and communication complexity, thereby hindering effective learning and decision making. UC-DRL employs centralized training, with a single CU acting as the central agent, but as the network scales up, it struggles with managing large volumes of global information, causing increased system delay and reduced performance. In SI-DRL, agents train and make decisions independently, lacking resource coordination, leading to lower CHP. Furthermore, LRU (the least recently used (LRU) algorithm is a widely adopted cache eviction method that targets the item in the cache which has been accessed the least recently for removal) lacks an intelligent learning mechanism and relies solely on historical requests, making it unable to adapt to dynamic content and complex network environments, resulting in consistently low CHP. In summary, SH-MAMFAC effectively addresses the scalability issues inherent in traditional CTDE methods in large-scale multi-agent environments, making it suitable for real-world large-scale networks that require servicing a substantial number of GUs and handling high data requests, thus improving system CHP and QoS.

Figure 6 presents a comparison of transmission delay for various algorithms under different bandwidth settings. As the CU bandwidth increases, the transmission delay for all strategies shows a decreasing trend. This indicates that a higher bandwidth improves the data transmission rate, thereby reducing transmission delay. The increase in bandwidth allows more data to be transmitted per unit time, enhancing the STP of D2D communication. This enables RUs to more frequently obtain the required files directly from CUs via D2D communication without frequently accessing the UAV or MBS, thereby reducing transmission delay. SH-MAMFAC demonstrates the lowest transmission delay across all bandwidth settings, due to its efficient cooperation among numerous CU agents, allowing RUs to acquire content more effectively from CUs and ensuring stable and superior performance under varying bandwidth conditions. UH-MADDPG, due to its reliance on joint action space, fails to fully utilize the potential of increased bandwidth, resulting in a less significant reduction in transmission delay compared to SH-MAMFAC. UC-DRL depends on a central agent for decision making, and the high communication overhead and delay required for information collection limit its performance improvement with increasing bandwidth. Thus, UC-DRL faces challenges in achieving efficient real-time decision making in large-scale networks. SI-DRL, although having good scalability with independent decision making for each agent, lacks a cooperation mechanism, leading to poor delay performance. Additionally, LRU, due to its inability to adapt to network dynamics and bandwidth changes, consistently has the highest transmission delay.

In Figure 7, the optimized trajectory of the UAV is presented. To clearly illustrate the network layout and communication mechanism, only 5 CUs and 20 RUs are shown. The trajectory of a UAV is obtained using the proposed SH-MAMFAC algorithm. CUs are represented by blue circles, RUs by green crosses, and the UAV’s flight path is indicated by a red line, illustrating its trajectory over a service duration of

T = 200

s. The D2D communication coverage area is depicted by cyan dashed circles. Some RUs are within the coverage of CUs and are primarily served by them. However, RUs outside the D2D coverage are served by either the UAV or the MBS, with those served by the UAV marked as green stars in the figure. This highlights the important role of UAVs in enhancing communication networks, particularly for RUs facing connectivity challenges. Additionally, due to limited battery capacity, the UAV must return after serving some RUs, resulting in only a subset of RUs being able to connect with it.

Figure 8 illustrates the transmission delay comparison of different algorithms under different UAV flight altitudes. The flight altitude of the UAV significantly affects the channel conditions and path loss, impacting communication performance. We observe that when the UAV is at a lower altitude, transmission delay increases with altitude. This is attributable to the reduced likelihood of LoS transmission and the greater average path loss associated with higher altitudes. As H continues to increase, the probability of LoS transmission improves and becomes more prevalent. During this phase, the average path loss decreases with increasing altitude, leading to a reduction in delay. When the probability of LoS transmission reaches a high level, for example, at

H = 160

m, the effect of altitude on the likelihood of LoS and NLoS transmissions becomes minimal. Consequently, the average path loss rises with H, leading to an increase in delay. It can be observed that SH-MAMFAC dynamically adjusts its strategy to effectively adapt to changes in the channel environment caused by altitude variations, maintaining the lowest transmission delay across various flight altitudes. This makes it particularly suitable for complex network environments requiring dynamic altitude adjustments.

Figure 9 compares the CHPs of different algorithms under varying UAV cache sizes. The results show that as the cache size increases, the CHPs of all algorithms improve significantly. This improvement is mainly due to larger cache capacity allowing more popular files to be stored, increasing the probability of requested files being cached and thus boosting the CHP. Specifically, the increase in CHP is especially notable as the cache size grows from smaller values. For instance, SH-MAMFAC shows a 42.48% increase in CHP when the cache size expands from 2 Mbits to 6 Mbits. This is because each additional unit of cache allows the system to store more frequently requested files, significantly enhancing the CHP. However, as the cache size further expands to larger values, the additional space is often used for less frequently accessed files, slowing down the CHP improvement. SH-MAMFAC consistently achieves the highest CHP across all cache sizes, demonstrating its ability to optimize caching strategies more effectively and maintain good scalability and robustness in large-scale network environments.

To demonstrate the advantages of air–ground cooperative networks and joint optimization, we implement SH-MAMFAC variants without trajectory optimization or power control. These strategies are named no-power SH-MAMFAC (NP-SH-MAMFAC), no-trajectory SH-MAMFAC (NT-SH-MAMFAC), and no-power and no-trajectory SH-MAMFAC (NPNT-SH-MAMFAC). When trajectory optimization is not performed, UAVs follow the circular trajectory defined in [39]. Figure 10 shows the transmission delay of different schemes at maximum transmission power

P_{max}

. Clearly, as

P_{max}

increases, the delay of all four schemes decreases. SH-MAMFAC achieves the lowest transmission delay by jointly optimizing UAV trajectory and power control. NP-SH-MAMFAC optimizes only the trajectory without power control, while NT-SH-MAMFAC optimizes only power control without trajectory optimization. At low transmission power (

P_{max} < 1.7

W), NP-SH-MAMFAC significantly outperforms NT-SH-MAMFAC in terms of delay. However, when

P_{max} > 1.7

W, NT-SH-MAMFAC performs better due to its power control, which allows further delay reduction at higher power levels. NPNT-SH-MAMFAC, lacking both power and trajectory optimization, shows the highest delay across all power settings, confirming the importance of joint optimization for performance improvement. These results indicate that trajectory optimization is more critical at low transmission power, as optimizing the UAV path effectively covers user areas, reduces transmission distance and path loss, and lowers delay. At higher transmission power, power control becomes more crucial, as dynamically adjusting transmission power improves efficiency, reduces interference, and significantly decreases transmission delay. SH-MAMFAC, by optimizing both trajectory and power control, leverages their synergy to ensure optimal communication performance under varying power conditions.

5. Conclusions

This paper addressed the joint power allocation, trajectory design, and caching placement problem in UAV-assisted D2D caching networks. We proposed a mixed-integer programming problem to maximize the total CHP, incorporating constraints on user demands and limited resources. To tackle scalability issues in large-scale UAV-assisted caching systems, we introduced the SH-MAMFAC algorithm, which employed mean-field MADRL to model interaction dynamics in heterogeneous multi-agent systems, including both homogeneous CUs and a UAV. In this framework, each agent makes autonomous decisions based on its local observations after centralized training. The simulation results demonstrated that our algorithm significantly reduced training time for large-scale multi-agent systems and offered strong deployability. Compared to traditional DRL algorithms, our approach effectively improved the CHP and reduced the transmission delay. Furthermore, the SH-MAMFAC algorithm outperformed the NP-SH-MAMFAC, NT-SH-MAMFAC, and NPNT-SH-MAMFAC algorithms in terms of delay, highlighting the necessity of joint optimization for enhancing network performance.

While SH-MAMFAC demonstrates excellent performance in addressing environmental dynamics, enhancing service satisfaction for GUs and improving deployment capabilities in real heterogeneous networks, there are still areas for further improvement. Specifically, the UAV-D2D network must continuously adapt and dynamically reconfigure for optimal performance, which presents significant challenges for centralized training in environments requiring real-time learning and adaptation. To address these challenges, we need to explore mechanisms for distributed decision making, system scalability, and overall efficiency. By enabling each CU or UAV agent to make autonomous decisions based on local observations and interactions rather than relying on centralized control, we can significantly enhance the system’s flexibility and adaptability. This distributed approach reduces communication overhead and the computational burden of centralized control. It also ensures effective scaling of the network as the number of agents increases. We will continue to explore these potential improvement directions in future work.

Author Contributions

Conceptualization, H.W. and T.L.; methodology, H.W. and T.L.; software, H.W. and X.W.; validation, X.W.; formal analysis, H.W.; investigation, R.W.; resources, R.W.; data curation, T.L.; writing—original draft preparation, H.W.; writing—review and editing, H.W. and S.X.; visualization, H.W.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have reviewed and consented to the final version of the manuscript.

Funding

This research received funding from the National Natural Science Foundation of China under Grant No. 62271395.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors have no conflicts of interest to disclose.

Appendix A. Summary of Acronyms

Table A1. Summary of acronyms.

Acronym	Description
AR	Augmented reality
BaB	Branch and bound
BS	Base station
CDF	Cumulative distribution function
CHP	Cache hit probability
CU/RU	Caching user/requesting user
CTDE	Centralized training with decentralized execution
D2D	Device to device
DRL	Deep reinforcement learning
LoS/NLoS	Line of sight/non-line of sight
MADDPG	Multi-agent deep deterministic policy gradient
MADRL	Multi-agent deep reinforcement learning
MIMO	Multiple input multiple output
MBS	Macro base station
NP-SH-MAMFAC	No-power scalable heterogeneous multi-agent mean-field actor–critic
NPNT-SH-MAMFAC	No-power and no-trajectory scalable heterogeneous multi-agent mean-field actor–critic
NT-SH-MAMFAC	No-trajectory scalable heterogeneous multi-agent mean-field actor–critic
PPO	Proximal policy optimization
SH-MAMFAC	Scalable heterogeneous multi-agent mean-field actor–critic
SI-DRL	Scalable independent deep reinforcement learning
STP	Successful transmission probability
TDMA/NOMA	Time-division multiple access/non-orthogonal multiple access
UH-MADDPG	Unscalable heterogeneous multi-agent deep deterministic policy gradient
UC-DRL	Unscalable centralized deep reinforcement learning

Appendix B. Proof for Lemma 1

The probability density functions of the random variables

x

and

y

can be expressed as

f_{x} (x) = \frac{1}{ς_{1}} exp (- \frac{x}{ς_{1}}) U (x)

and

f_{y} (y) = \frac{1}{ς_{2}} exp (- \frac{y}{ς_{2}} + \frac{1}{ς_{2}}) U (y - 1)

. Subsequently, the probability density function of the random variable

z = x / y

can be calculated as follows:

\begin{matrix} f_{z} (z) & = \int_{1}^{\infty} y f_{x y} (y z, y) d y \\ = [\frac{1}{(ς_{1} + ς_{2} z)} + \frac{ς_{1} ς_{2}}{{(ς_{1} + ς_{2} z)}^{2}}] exp (- \frac{z}{ς_{1}}) . \end{matrix}

(A1)

According to (A1), the CDF can be expressed as

F_{z} (z) = \int_{0}^{z} [\frac{e^{- \frac{x}{ς_{1}}}}{(ς_{1} + ς_{2} x)} + \frac{ς_{1} ς_{2} e^{- \frac{x}{ς_{1}}}}{{(ς_{1} + ς_{2} x)}^{2}}] d x .

(A2)

Utilizing the method of integration by parts, the right-hand side of (A2) can be computed as

\begin{matrix} \int_{0}^{z} \frac{ς_{1} ς_{2} e^{- \frac{x}{ς_{1}}}}{{(ς_{1} + ς_{2} x)}^{2}} d x & = {\frac{- ς_{1} e^{- \frac{x}{ς_{1}}}}{ς_{1} + ς_{2} x}|}_{0}^{z} - \int_{0}^{z} \frac{e^{- \frac{x}{ς_{1}}}}{(ς_{1} + ς_{2} x)} d x \\ = 1 - \frac{ς_{1} e^{- \frac{z}{ς_{1}}}}{ς_{1} + ς_{2} z} - \int_{0}^{z} \frac{e^{- \frac{x}{ς_{1}}}}{(ς_{1} + ς_{2} x)} d x . \end{matrix}

(A3)

By substituting (A3) into (A2), the desired result (5) can be obtained.

Appendix C. Derivation of D2D Successful Transmission Probability

The CDF of

r_{i, j} (t)

is provided by (6). By utilizing conditional probability, the STP can be obtained as follows:

\begin{matrix} \Pr_{i, j} (t) & = \Pr \{r_{i, j} (t) ⩾ \frac{S_{f}}{T_{i, j}}\} \\ = \Pr \{r_{i, j} (t) ⩾ \frac{S_{f}}{T_{i, j}}, T_{i, j} > 0\} + \Pr \{r_{i, j} (t) ⩽ \frac{S_{f}}{T_{i, j}}, T_{i, j} < 0\} \\ = \int_{0}^{+ \infty} [1 - F_{r_{i, j} (t)} (\frac{S_{f}}{t})] \cdot \frac{1}{μ_{i, j}} e^{- \frac{t}{μ_{i, j}}} d t \\ = \int_{0}^{+ \infty} \frac{ς_{1}}{μ_{i, j} (ς_{1} + ς_{2} (2^{\frac{S_{f}}{t \cdot B_{d}}} - 1))} \times exp (- (\frac{2^{\frac{S_{f}}{t \cdot B_{d}}} - 1}{ς_{1}} + \frac{t}{μ_{i, j}})) d t . \end{matrix}

(A4)

Appendix D. Proof for Theorem 1

We illustrate the validity of the Q-function approximation using the CU agent as an example. Mean-field theory assumes that all agents have the same action space and defines

{\bar{a}}_{1, i} = \frac{1}{|K_{k_{1}}|} \sum_{k_{1} \in K_{k_{1}}} a_{1, k_{1}}

as the mean action based on the neighborhood

K_{k_{1}}

of agent i. The one-hot action of its neighbor

k_{1}

is represented as the sum of the mean action

{\bar{a}}_{1, i}

and a small fluctuation

δ a_{1, k_{1}}

, i.e.,

a_{1, k_{1}} = {\bar{a}}_{1, i} + δ a_{1, k_{1}}

.

According to Taylor’s theorem, the pairwise Q-function

Q_{i} (o, a_{i}, a_{1, k_{1}})

can be expressed as

\begin{matrix} Q_{i} (o, a) & = \frac{1}{|K_{k_{1}}|} \sum_{k_{1} \in K_{k_{1}}} Q_{i} (o, a_{i}, a_{1, k_{1}}) \\ = \frac{1}{|K_{k_{1}}|} \sum_{k_{1}} [Q_{i} (o, a_{i}, {\bar{a}}_{1, i}) + \nabla_{{\bar{a}}_{1, i}} Q_{i} (o, a_{i}, {\bar{a}}_{1, i}) \cdot δ a_{1, k_{1}} \\ + \frac{1}{2} δ a_{1, k_{1}} \nabla_{{\tilde{a}}_{1, k_{1}}}^{2} Q_{i} (o, a_{i}, {\tilde{a}}_{1, k_{1}}) \cdot δ a_{1, k_{1}}] \\ = Q_{i} (o, a_{i}, {\bar{a}}_{1, i}) + \underset{first - order term}{\underset{︸}{\nabla_{{\bar{a}}_{1, i}} Q_{i} (o, a_{i}, {\bar{a}}_{1, i}) \cdot [\frac{1}{|K_{k_{1}}|} \sum_{k_{1}} δ a_{1, k_{1}}]}} \\ + \underset{second - order term}{\underset{︸}{\frac{1}{2 |K_{k_{1}}|} \sum_{k_{1}} [δ a_{1, k_{1}} \cdot \nabla_{{\tilde{a}}_{1, k_{1}}}^{2} Q_{i} (o, a_{i}, {\tilde{a}}_{1, k_{1}}) \cdot δ a_{1, k_{1}}]}} \\ = Q_{i} (o, a_{i}, {\bar{a}}_{1, i}) + \frac{1}{2 |K_{k_{1}}|} G_{i} (a_{1, k_{1}}) \\ \approx Q_{i} (o, a_{i}, {\bar{a}}_{1, i}), \end{matrix}

(A5)

where

G_{i} (a_{1, k_{1}}) = δ a_{1, k_{1}} \cdot \nabla_{{\tilde{a}}_{1, k_{1}}}^{2} Q_{i} (o, a_{i}, {\tilde{a}}_{1, k_{1}}) \cdot δ a_{1, k_{1}}

is the remainder of the Taylor polynomial, in which

{\tilde{a}}_{1, k_{1}} = {\bar{a}}_{1, i} + ϵ_{1, k_{1}} δ a_{1, k_{1}}

and

ϵ_{1, k_{1}} \in [0, 1]

. Since

\frac{1}{|K_{k_{1}}|} \sum_{k_{1}} δ a_{1, k_{1}} = 0

, the first-order term is eliminated. It can be further shown that under the M-smooth condition (e.g., linear functions) for

Q_{i} (o, a)

, the remainder

G_{i} (a_{1, k_{1}})

approaches zero as a small fluctuation [21,37]. Assuming that all agents are uniform and localized within their neighborhoods, the remainders tend to cancel each other out, leading to the approximation of

Q_{i} (o, a_{i}, {\bar{a}}_{1, i})

. Combining the two types of agents, we obtain the following approximation:

Q_{k} (o, a) = \frac{1}{|K_{k}|} \sum_{k} Q_{k} (o, a_{k}, a_{1, k_{1}}, a_{2, k_{2}}) \approx Q_{k} (o, a_{k}, {\bar{a}}_{1, k}, {\bar{a}}_{2, k}) .

(A6)

Appendix E. Proof for Theorem 2

Lemma A1.

Under Assumption 3, the Nash operator

H^{N a s h}

constitutes a contraction mapping in the complete metric space from

Q

to

Q

, with its fixed point representing the Nash Q-value of the entire game, such that

H^{Nash} Q_{*} = Q_{*}

.

Proof.

For a detailed proof, see Theorem 17 in [40]. □

Lemma A2.

The random process

Δ_{t}

defined in

R

as

Δ_{t + 1} (x) = (1 - α_{t}) Δ_{t} (x) + α_{t} (x) F_{t} (x),

(A7)

converges to zero with probability 1 (w.p.1) when

1 . 0 ⩽ α_{t} (x) ⩽ 1, \sum_{t} α_{t} (x) = \infty, \sum_{t} α_{t}^{2} (x) = \infty,

(A8)

2 . x \in X; |X| < \infty,

(A9)

where

X

is the set of possible states,

3 . {∥E [F_{t} (x) F_{t}]∥}_{W} ⩽ γ {∥Δ_{t}∥}_{W} + c_{t},

(A10)

where

γ \in [0, 1)

and

c_{t}

converges to zero w.p.1,

4 . v a r [F_{t} (x) | F_{t}] ⩽ K (1 + {∥Δ∥}_{W}^{2}),

(A11)

with constant

K > 0

,

F_{t}

represents the filtration of an increasing sequence of σ-fields that encompasses the history of the processes;

α_{t}, Δ_{t}, F_{t} \in F_{t}

, and

{∥ \cdot ∥}_{W}

denotes a weighted maximum norm.

Proof.

This lemma is derived from Theorem 1 in [41]. □

The Nash operator

H^{Nash}

is defined as follows:

H^{Nash} = E_{o^{'} \sim P} [r (o, a) + γ v^{Nash} (o^{'})],

(A12)

where

Q ≜ [Q_{1}, \dots, Q_{I + 1}]

and

r (o, a) ≜ [r_{1} (o, a), \dots, r_{I + 1} (o, a)]

. The Nash value function is defined as

v^{Nash} (o) = [v_{1, π_{*}} (o), \dots, v_{I + 1, π_{*}} (o)]

. Here, the Nash policy is denoted by

π_{*}

, and the Nash value function is computed under the assumption that all agents are adhering to

π_{*}

from the initial state s.

To prove Theorem 2, we need to apply Lemma A2. By subtracting

Q_{*} (o, a)

from both sides of (32) and relating it to (A7),

\begin{matrix} Δ_{t} (x) & = Q_{k, t} (o, a_{k}, {\bar{a}}_{l, k}) - Q_{*} (o, a), \end{matrix}

(A13a)

\begin{matrix} F_{t} (x) & = r_{t} + γ v_{t}^{MF} (o_{t + 1}) - Q_{*} (o_{t}, a_{t}), \end{matrix}

(A13b)

where

x ≜ (o_{t}, a_{t})

represents the state–action pair visited at time t.

The objective is to demonstrate that the four conditions of Lemma A2 are satisfied and that Lemma A2 indicates the convergence of

Δ_{t} (x)

to zero. If this is the case, the sequence of Q values will asymptotically approach the Nash Q-value

Q^{*}

. In (A7),

α (t)

denotes the learning rate, which automatically satisfies the first condition of Lemma A2. The second condition is also valid since we are considering finite state and action spaces.

Let

F_{t}

be the

σ

-field generated by all random variables in the history up to time t, which includes

(o_{t}, α_{t}, a_{t}, r_{t - 1}, \dots, o_{1}, α_{1}, a_{1}, Q_{0})

. Consequently,

Q_{t}

is a random variable based on the historical trajectory up to time t.

To demonstrate the third condition of Lemma A2, we refer to (A13):

\begin{matrix} F_{t} (o_{t}, a_{t}) & = r_{t} + γ v_{t}^{MF} (o_{t + 1}) - Q_{*} (o_{t}, a_{t}) \\ = r_{t} + γ v_{t}^{Nash} (o_{t + 1}) - Q_{*} (o_{t}, a_{t}) + γ [v_{t}^{MF} (o_{t + 1}) - v_{t}^{Nash} (o_{t + 1})] \\ = [r_{t} + γ v_{t}^{Nash} (o_{t + 1}) - Q_{*} (o_{t}, a_{t})] + C_{t} (o_{t}, a_{t}) \\ = F_{t}^{Nash} (o_{t}, a_{t}) + C_{t} (o_{t}, a_{t}) . \end{matrix}

(A14)

According to Lemma A1,

F_{t}^{Nash}

establishes a contraction mapping with the norm

{∥ \cdot ∥}_{\infty}

defined as the maximum norm on

a

. Thus, we have for all t that

{∥E [F_{t}^{Nash} (o_{t}, a_{t}) | F_{t}]∥}_{\infty} ⩽ γ {∥Q_{t} - Q_{*}∥}_{\infty} = γ {∥Δ_{t}∥}_{\infty} .

(A15)

Now, we substitute (A15) into (A14):

\begin{matrix} {∥E F_{t} (o_{t}, a_{t}) ∣ F_{t}∥}_{\infty} & ⩽ {∥F_{t}^{Nash} (o_{t}, a_{t}) ∣ F_{t}∥}_{\infty} + {∥C_{t} (o_{t}, a_{t}) ∣ F_{t}∥}_{\infty} \\ ⩽ γ {∥Δ_{t}∥}_{\infty} + {∥C_{t} (o_{t}, a_{t}) ∣ F_{t}∥}_{\infty} . \end{matrix}

(A16)

Since we are using the max norm, the last two terms on the right side of (A16) are both positive and finite. We can demonstrate that the term

∥C_{t} (o_{t}, a_{t})∥

converges to zero w.p.1. This proof relies on Assumption 3 (see Theorem 1 in [21]). We utilize this fact in the last term of (A16). Therefore, the third condition of Lemma A2 is established.

For the fourth condition, we leverage the earlier conclusion that

H^{MF}

establishes a contraction mapping, specifically

H^{MF} Q_{*} = Q_{*}

. It follows that

\begin{matrix} var [F_{t} (o_{t}, a_{t}) ∣ F_{t}] & = E [{(r_{t} + γ v_{t}^{MF} (o_{t + 1}) - Q_{*} (o_{t}, a_{t}))}^{2}] \\ = E [{(r_{t} + γ v_{t}^{MF} (o_{t + 1}) - H^{MF} (Q_{*}))}^{2}] \\ = var [r_{t} + γ v_{t}^{MF} (o_{t + 1}) ∣ F_{t}] \\ \leq K (1 + ‖ Δ_{t} ‖_{W}^{2}) . \end{matrix}

(A17)

In the final step of (A17), we utilize Assumption 1, which states that the reward

r_{t}

is always bounded by a constant. Thus, with all conditions satisfied, it follows from Lemma A2 that

Δ_{t}

converges to zero w.p.1, meaning that

Q_{t}

converges to

Q_{*}

w.p.1.

References

Zeng, Y.; Zhang, R.; Lim, T.J. Wireless communications with unmanned aerial vehicles: Opportunities and challenges. IEEE Commun. Mag. 2016, 54, 36–42. [Google Scholar] [CrossRef]
Zhang, T.; Wang, Y.; Liu, Y.; Xu, W.; Nallanathan, A. Cache-enabling UAV communications: Network deployment and resource allocation. IEEE Trans. Wirel. Commun. 2020, 19, 7470–7483. [Google Scholar] [CrossRef]
Lyu, J.; Zeng, Y.; Zhang, R. UAV-aided offloading for cellular hotspot. IEEE Trans. Wirel. Commun. 2018, 17, 3988–4001. [Google Scholar] [CrossRef]
Li, Y.; Xia, M. Ground-to-air communications beyond 5G: A coordinated multipoint transmission based on Poisson-Delaunay triangulation. IEEE Trans. Wirel. Commun. 2022, 22, 1841–1854. [Google Scholar] [CrossRef]
Zhao, N.; Liu, X.; Yu, F.R.; Li, M.; Leung, V.C. Communications, caching, and computing oriented small cell networks with interference alignment. IEEE Commun. Mag. 2016, 54, 29–35. [Google Scholar] [CrossRef]
Liu, D.; Chen, B.; Yang, C.; Molisch, A.F. Caching at the wireless edge: Design aspects, challenges, and future directions. IEEE Commun. Mag. 2016, 54, 22–28. [Google Scholar] [CrossRef]
Zhang, T.; Wang, Y.; Yi, W.; Liu, Y.; Nallanathan, A. Joint optimization of caching placement and trajectory for UAV-D2D networks. IEEE Trans. Commun. 2022, 70, 5514–5527. [Google Scholar] [CrossRef]
Ding, R.; Xu, Y.; Gao, F.; Shen, X. Trajectory design and access control for air–ground coordinated communications system with multiagent deep reinforcement learning. IEEE Internet Things J. 2021, 9, 5785–5798. [Google Scholar] [CrossRef]
Zhang, Y.; Mou, Z.; Gao, F.; Jiang, J.; Ding, R.; Han, Z. UAV-enabled secure communications by multi-agent deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 11599–11611. [Google Scholar] [CrossRef]
Chang, Z.; Deng, H.; You, L.; Min, G.; Garg, S.; Kaddoum, G. Trajectory design and resource allocation for multi-UAV networks: Deep reinforcement learning approaches. IEEE Trans. Netw. Sci. Eng. 2022, 10, 2940–2951. [Google Scholar] [CrossRef]
Yu, J.; Li, Y.; Liu, X.; Sun, B.; Wu, Y.; Tsang, D.H.K. IRS assisted NOMA aided mobile edge computing with queue stability: Heterogeneous multi-agent reinforcement learning. IEEE Trans. Wirel. Commun. 2022, 22, 4296–4312. [Google Scholar] [CrossRef]
Li, Y.; Aghvami, A.H.; Dong, D. Path Planning for Cellular-Connected UAV: A DRL Solution With Quantum-Inspired Experience Replay. IEEE Trans. Wirel. Commun. 2022, 21, 7897–7912. [Google Scholar] [CrossRef]
Chen, Y.J.; Liao, K.M.; Ku, M.L.; Tso, F.P.; Chen, G.Y. Multi-Agent Reinforcement Learning Based 3D Trajectory Design in Aerial-Terrestrial Wireless Caching Networks. IEEE Trans. Veh. Technol. 2021, 70, 8201–8215. [Google Scholar] [CrossRef]
Zhang, T.; Wang, Z.; Liu, Y.; Xu, W.; Nallanathan, A. Joint resource, deployment, and caching optimization for AR applications in dynamic UAV NOMA networks. IEEE Trans. Wirel. Commun. 2021, 21, 3409–3422. [Google Scholar] [CrossRef]
Al-Hilo, A.; Samir, M.; Assi, C.; Sharafeddine, S.; Ebrahimi, D. UAV-assisted content delivery in intelligent transportation systems-joint trajectory planning and cache management. IEEE Trans. Intell. Transp. Syst. 2020, 22, 5155–5167. [Google Scholar] [CrossRef]
Qin, P.; Fu, Y.; Zhang, J.; Geng, S.; Liu, J.; Zhao, X. DRL-Based Resource Allocation and Trajectory Planning for NOMA-Enabled Multi-UAV Collaborative Caching 6 G Network. IEEE Trans. Veh. Technol. 2024, 73, 8750–8764. [Google Scholar] [CrossRef]
Ji, J.; Zhu, K.; Cai, L. Trajectory and communication design for cache-enabled UAVs in cellular networks: A deep reinforcement learning approach. IEEE Trans. Mob. Comput. 2022, 22, 6190–6204. [Google Scholar] [CrossRef]
Peng, H.; Shen, X. Multi-agent reinforcement learning based resource management in MEC-and UAV-assisted vehicular networks. IEEE J. Sel. Areas Commun. 2020, 39, 131–141. [Google Scholar] [CrossRef]
Araf, S.; Saha, A.S.; Kazi, S.H.; Tran, N.H.; Alam, M.G.R. UAV Assisted Cooperative Caching on Network Edge Using Multi-Agent Actor-Critic Reinforcement Learning. IEEE Trans. Veh. Technol. 2023, 72, 2322–2337. [Google Scholar] [CrossRef]
Rodrigues, T.K.; Suto, K.; Nishiyama, H.; Liu, J.; Kato, N. Machine learning meets computation and communication control in evolving edge and cloud: Challenges and future perspective. IEEE Commun. Surv. Tutor. 2019, 22, 38–67. [Google Scholar] [CrossRef]
Yang, Y.; Luo, R.; Li, M.; Zhou, M.; Zhang, W.; Wang, J. Mean field multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5571–5580. [Google Scholar] [CrossRef]
Baştuğ, E.; Bennis, M.; Zeydan, E.; Kader, M.A.; Karatepe, I.A.; Er, A.S.; Debbah, M. Big data meets telcos: A proactive caching perspective. J. Commun. Netw. 2015, 17, 549–557. [Google Scholar] [CrossRef]
Wu, H.; Lyu, F.; Zhou, C.; Chen, J.; Wang, L.; Shen, X. Optimal UAV caching and trajectory in aerial-assisted vehicular networks: A learning-based approach. IEEE J. Sel. Areas Commun. 2020, 38, 2783–2797. [Google Scholar] [CrossRef]
Ji, J.; Zhu, K.; Niyato, D.; Wang, R. Joint cache placement, flight trajectory, and transmission power optimization for multi-UAV assisted wireless networks. IEEE Trans. Wirel. Commun. 2020, 19, 5389–5403. [Google Scholar] [CrossRef]
Wang, J.; Zhou, X.; Zhang, H.; Yuan, D. Joint trajectory design and power allocation for uav assisted network with user mobility. IEEE Trans. Veh. Technol. 2023, 72, 13173–13189. [Google Scholar] [CrossRef]
Ji, J.; Zhu, K.; Niyato, D.; Wang, R. Joint trajectory design and resource allocation for secure transmission in cache-enabled UAV-relaying networks with D2D communications. IEEE Internet Things J. 2020, 8, 1557–1571. [Google Scholar] [CrossRef]
Mozaffari, M.; Saad, W.; Bennis, M.; Debbah, M. Unmanned aerial vehicle with underlaid device-to-device communications: Performance and tradeoffs. IEEE Trans. Wirel. Commun. 2016, 15, 3949–3963. [Google Scholar] [CrossRef]
Wang, L.; Tang, H.; Čierny, M. Device-to-device link admission policy based on social interaction information. IEEE Trans. Veh. Technol. 2014, 64, 4180–4186. [Google Scholar] [CrossRef]
Zhang, Y.; Pan, E.; Song, L.; Saad, W.; Dawy, Z.; Han, Z. Social network enhanced device-to-device communication underlaying cellular networks. In Proceedings of the 2013 IEEE/CIC International Conference on Communications in China-Workshops (CIC/ICCC), Xi’an, China, 12–14 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 182–186. [Google Scholar] [CrossRef]
Feng, Q.; McGeehan, J.; Tameh, E.K.; Nix, A.R. Path loss models for air-to-ground radio channels in urban environments. In Proceedings of the 2006 IEEE 63rd Vehicular Technology Conference, Melbourne, Australia, 7–10 May 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 6, pp. 2901–2905. [Google Scholar] [CrossRef]
Gao, X.; Wang, X.; Qian, Z. Probabilistic Caching Strategy and TinyML-Based Trajectory Planning in UAV-Assisted Cellular IoT System. IEEE Internet Things J. 2024, 11, 21227–21238. [Google Scholar] [CrossRef]
Zeng, Y.; Xu, J.; Zhang, R. Energy minimization for wireless communication with rotary-wing UAV. IEEE Trans. Wirel. Commun. 2019, 18, 2329–2345. [Google Scholar] [CrossRef]
Kangasharju, J.; Roberts, J.; Ross, K.W. Object replication strategies in content distribution networks. Comput. Commun. 2002, 25, 376–383. [Google Scholar] [CrossRef]
Liang, L.; Ye, H.; Li, G.Y. Spectrum sharing in vehicular networks based on multi-agent reinforcement learning. IEEE J. Sel. Areas Commun. 2019, 37, 2282–2292. [Google Scholar] [CrossRef]
Lu, Y.; Zhang, M.; Tang, M. Caching for Edge Inference at Scale: A Mean Field Multi-Agent Reinforcement Learning Approach. In Proceedings of the GLOBECOM 2023-2023 IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 8–12 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 332–337. [Google Scholar] [CrossRef]
Zhang, H.; Tang, H.; Hu, Y.; Wei, X.; Wu, C.; Ding, W.; Zhang, X.P. Heterogeneous mean-field multi-agent reinforcement learning for communication routing selection in SAGI-net. In Proceedings of the 2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
Qiu, D.; Wang, J.; Dong, Z.; Wang, Y.; Strbac, G. Mean-field multi-agent reinforcement learning for peer-to-peer multi-energy trading. IEEE Trans. Power Syst. 2022, 38, 4853–4866. [Google Scholar] [CrossRef]
Li, T.; Zhu, K.; Luong, N.C.; Niyato, D.; Wu, Q.; Zhang, Y.; Chen, B. Applications of multi-agent reinforcement learning in future internet: A comprehensive survey. IEEE Commun. Surv. Tutor. 2022, 24, 1240–1279. [Google Scholar] [CrossRef]
Zhao, N.; Pang, X.; Li, Z.; Chen, Y.; Li, F.; Ding, Z.; Alouini, M.S. Joint trajectory and precoding optimization for UAV-assisted NOMA networks. IEEE Trans. Commun. 2019, 67, 3723–3735. [Google Scholar] [CrossRef]
Hu, J.; Wellman, M.P. Nash Q-Learning for General-Sum Stochastic Games. J. Mach. Learn. Res. 2003, 4, 1039–1069. [Google Scholar]
Jaakkola, T.; Jordan, M.I.; Singh, S.P. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Comput. 1994, 6, 1185–1201. [Google Scholar] [CrossRef]

Figure 1. UAV-assisted D2D caching networks.

Figure 2. The MADRL framwork for UAV-assisted D2D cache network.

Figure 3. The structure of the SH-MAMFAC framework.

Figure 4. Training processes of different algorithms.

Figure 5. Cache hit probability comparison under different RU numbers.

Figure 6. Transmission delay comparison under different CU bandwidths.

Figure 7. Optimized UAV trajectory obtained by the proposed algorithm.

Figure 8. Transmission delay comparison under different UAV heights.

Figure 9. Cache hit probability comparison under different UAV cache spaces.

Figure 10. Transmission delay comparison under different maximum transmission power.

Table 1. Summary of existing works.

Work	Trajectory Design	Heterogeneous Services	Mixed Actions	Joint Optimization	Scalability
[8]	✓	✕	✓	✓	✕
[10]	✓	✕	✕	✓	✕
[16]	✓	✕	✓	✓	✕
[17]	✓	✓	✓	✓	✕
[18]	✕	✓	✓	✓	✕
[13]	✓	✕	✕	✕	✕
[14]	✓	✕	✕	✓	✕
[15]	✓	✕	✕	✓	✕
[19]	✕	✓	✕	✓	✕
Ours	✓	✓	✓	✓	✓

Table 2. List of main symbols and variables.

Parameter	Description
I	Number of CUs.
J	Number of RUs.
F	Number of files.
$Q$	UAV flight trajectory $Q = [q (1), \dots, q (N)]$ .
$C_{d}$	Cache placement matrix of CUs.
$c_{u}$	Cache placement vector of UAV.
$p$	Transmission power vector of CUs.
$r_{i, j}$	Data rate from CU i to RU j.
$\Pr_{i, j}$	Successful transmission probability from CU i to RU j.
$r_{u, j}$	Data rate from UAV to RU j.
$R_{d}$	Maximum communication range for D2D transmission.
$R_{u}$	Maximum communication range for UAV transmission.
$\Pr_{LoS, u, j}$	LoS channel probability.
$N_{0}$	Noise spectral density.
$r e q_{j, f}$	The probability of RU j requesting file f.
$G_{j, f}$	The set of CUs that meet the conditions for establishing D2D links.
H	The total cache hit probability.
$T_{j}$	The total transmission delay for RU j.

Table 3. Summary of simulation parameters.

System Parameters		Training Parameters
Parameters	Values	Parameters	Values
$δ_{t}, T$ [7,8]	0.2 s, 200 s	$σ_{N}$	0.1
$A, U$ [25]	0.503, 120	$γ$	0.95
$s, ρ$ [25]	0.05, 1.225	$M, M_{i}$	3000
$B_{m}$ [17]	20 MHz	$B, B_{i}$	128
$d_{r}, v_{r}$ [25]	0.3, 4.03	$κ$	0.01
$P_{u}$	3.5 W	$ϱ_{π}$	$10^{- 5}$
$N_{0}$ [17]	−100 dBm/Hz	$ϱ_{Q}$	$10^{- 4}$
$P_{0}, P_{1}$ [25]	79.86 W, 88.63 W	$L_{ac}, L_{cr}$	5, 5
$E_{u, max}$ [23]	50 KJ	$n_{ac, 1 \to (L_{ac} - 1)}$	256, 64, 32
V [23]	15 m/s	$n_{cr, 1 \to (L_{cr} - 1)}$	256, 64, 32

Table 4. Computational performance of different algorithms.

Algorithm	Episodic Time (s)	Number of Episodes	Total Time (h)
SH-MAMFAC	1.38	1600	0.61
UH-MADDPG	1.79	3000	1.49
UC-DRL	2.62	5500 *	4.00 *
SI-DRL	0.96	1100	0.29

* Fail to complete training within 4 h.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Li, H.; Wang, X.; Xia, S.; Liu, T.; Wang, R. Resource Allocation in UAV-D2D Networks: A Scalable Heterogeneous Multi-Agent Deep Reinforcement Learning Approach. Electronics 2024, 13, 4401. https://doi.org/10.3390/electronics13224401

AMA Style

Wang H, Li H, Wang X, Xia S, Liu T, Wang R. Resource Allocation in UAV-D2D Networks: A Scalable Heterogeneous Multi-Agent Deep Reinforcement Learning Approach. Electronics. 2024; 13(22):4401. https://doi.org/10.3390/electronics13224401

Chicago/Turabian Style

Wang, Huayuan, Hui Li, Xin Wang, Shilin Xia, Tao Liu, and Ruonan Wang. 2024. "Resource Allocation in UAV-D2D Networks: A Scalable Heterogeneous Multi-Agent Deep Reinforcement Learning Approach" Electronics 13, no. 22: 4401. https://doi.org/10.3390/electronics13224401

APA Style

Wang, H., Li, H., Wang, X., Xia, S., Liu, T., & Wang, R. (2024). Resource Allocation in UAV-D2D Networks: A Scalable Heterogeneous Multi-Agent Deep Reinforcement Learning Approach. Electronics, 13(22), 4401. https://doi.org/10.3390/electronics13224401

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Resource Allocation in UAV-D2D Networks: A Scalable Heterogeneous Multi-Agent Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. System Model

2.1. UAV Mobility Model

2.2. Communication Model

2.3. Transmission Delay Model

2.4. UAV Energy Consumption Model

2.5. Cache Hit Probability Model

3. Problem Formulation and Optimization

3.1. Problem Formulation

3.2. Partially Observable Markov Game

3.3. Markov Decision Process

3.4. Scalable Heterogeneous Multi-Agent Mean-Field Actor–Critic

3.5. Computational Complexity Analysis

4. Numerical Result

4.1. Simulation Settings

4.2. Performance Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Summary of Acronyms

Appendix B. Proof for Lemma 1

Appendix C. Derivation of D2D Successful Transmission Probability

Appendix D. Proof for Theorem 1

Appendix E. Proof for Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI