1. Introduction
Due to the low cost and flexibility of unmanned aerial vehicles (UAVs), they are frequently employed in various search missions. Depending on the type of target motion, these problems can be categorized into static target search problems and dynamic target search problems. In static target search problems, common scenarios include resource exploration in harsh environments and rescuing individuals trapped in earthquakes. In dynamic target search problems, application scenarios encompass rescuing lost tourists in forests and searching for enemy targets cruising in warfare.
Based on the differences in modeling approaches, the existing methods for UAV search can be primarily categorized into four types: traditional methods, heuristic methods, probability-based methods, and deep reinforcement learning-based methods. Traditional methods can be further divided into exact algorithms and approximation algorithms. The notable representatives of exact algorithms are integer programming algorithms and A* search algorithms [
1]. Approximation algorithms include Voronoi diagram methods [
2] and visibility graph-based methods [
3]. Traditional methods have good interpretability and exhibit excellent performance on simple problems. However, they often lack scalability and are difficult to model and solve when dealing with complex issues. Heuristic algorithms encompass genetic algorithms [
4,
5], ant colony optimization [
6], wolf pack algorithms [
7], and particle swarm optimization [
8,
9]. These algorithms can be applied to search problems of various scales and are often easier to model. Probability-based methods [
10,
11] often exhibit superior interpretability compared to intelligent algorithms, and in some instances, they may also demonstrate enhanced performance. Nevertheless, these methods yield fixed solutions. When the scenario undergoes slight changes, the solutions obtained by these algorithms cannot automatically adapt to the variations in the scenario. In contrast, deep reinforcement learning-based algorithms can produce an intelligent agent capable of real-time decision-making, thereby possessing a better ability to adapt to dynamic scenarios.
The characteristics of the four categories of methods are presented in
Table 1:
In conclusion, deep reinforcement learning methods are emerging as a focal point for researchers due to their ease of modeling, scalability to problems of varying magnitudes, and adaptability to changing scenarios. These attributes make deep reinforcement learning an increasingly attractive approach in the field.
Researchers have proposed numerous methods for guiding search using deep reinforcement learning algorithms. Imanberdiyev et al. [
12] solved the problem of autonomous drone path planning in a positional environment using model-based reinforcement learning methods, while Wang et al. [
13] proposed a reinforcement learning method for this problem that does not require building a planning model. They investigated drone search strategies in scenarios with limited communication spectrum. Wang Tian et al. [
14] studied the search and tracking problem for multiple drones under energy constraints. They explored drone search schemes in scenarios with potential threats. Shen et al. [
15] proposed an improved multi-agent reinforcement learning algorithm for the scenario of multiple UAVs searching for multiple static targets. Su et al. [
16] investigated a multi-agent reinforcement learning algorithm with constraints on sensing range and communication range. Zhang et al. [
17] investigated a multi-UAV search method that combines belief probability graphs and an improved DDPG algorithm.
In these studies, reinforcement learning demonstrated its powerful adaptability and scalability. However, its performance is dependent on extended training times. To address this issue, researchers have proposed numerous agent–environment parallel interaction frameworks. These frameworks leverage the multi-core advantages of modern processors, transforming the original serial multi-round interactions between agents and environments into parallel single-round interactions.
For instance, Ray [
18] and RLlib [
19] can harness the hardware processing capabilities of distributed systems. Dalton et al. [
20] demonstrated that running the Atari environment on GPUs significantly accelerates the agent’s learning speed by increasing the number of environments. Subsequently, Makoviychuk et al. [
21] proposed Isaac Gym as a GPU-parallel simulation platform for reinforcement learning, serving as a new benchmark environment for reinforcement learning. Weng et al. introduced Envpool [
22], which supports not only the Atari environment but also complex environments such as StarCraft. In various subfields, parallel agent–environment interaction methods have been extensively utilized. For instance, Liu et al. [
23] developed a high-performance GPU-parallel underwater robot learning environment. Freeman et al. [
24] proposed a differentiable physics simulation tool utilizing GPU parallel computing.
However, when addressing the long-duration search problem for UAVs, these methods still face two primary challenges. First, regardless of the number of processor cores utilized, agents still require at least one complete round of interaction with the environment. When the application scenario of drones involves searching for unknown dynamic targets, the uncertainty of target movement information can lead to lengthy search times. As a result, each round of simulation consumes considerable time, thereby slowing the learning speed of the agent. Secondly, a large number of simulated environments consume substantial GPU memory. Since all exploration data of the agents need to be stored prior to updating the agent network, both the increase in the number of environments and the prolongation of agent exploration time will lead to increased memory usage. Consequently, when addressing the problem of long-term drone search, memory constraints may prevent the creation of a sufficient number of simulated environments for agent training.
In our previous work [
25], we proposed the OC-MAPPO method, which integrates optimal control with DRL (deep reinforcement learning) to address the multi-UAV cooperative search problem. The experimental results demonstrated its enhanced adaptability to dynamic environments. Additionally, we employed the method of environment pool proposed in [
22], utilizing parallel environments to accelerate agent learning. During experiments, we observed that increasing the number of environments from 1 to 10 significantly improved the agent’s learning speed, as the interaction time with the environment changed from multiple rounds to a single round. Yet, further increasing the number of environments to several hundreds or even thousands did not alter the learning speed. This is because the data collected before policy updates originate from the same distribution, and increasing the number of environments merely results in more homogeneous data. Then, how to overcome such a challenge and speed up DRL training by quantitatively increasing the parallel training environment become a problem area of interest.
In this paper, through analysis and experimentation, we discover that in the context of UAV cooperative search tasks, it is possible to designate different time segments for different environments during the parallel process. This approach allows agents to engage in minimal forward interactions with each environment, further expediting the learning speed of the agents. Our findings suggest a novel approach to optimizing the efficiency of reinforcement learning in multi-agent UAV scenarios, potentially overcoming existing limitations in the field.
The remainder of this article is structured as follows:
Section 2 delineates the problem scenario under investigation.
Section 3 elucidates and analyzes extant methodologies.
Section 4 expounds upon the proposed approach.
Section 5 empirically validates the efficacy of the proposed method through experimental evaluation.
3. Preliminary and Analysis
Existing reinforcement learning methods typically involve an agent interacting with the environment over multiple episodes, denoted as C. The agent improves its behavior based on the experiences gathered. Without any parallelization, the time taken for the agent to update its policy after interacting with the environment is . If multiple environments are initiated simultaneously for training, as in the OC-MAPPO method, the time required for the agent to update its policy can be reduced to T. However, in OC-MAPPO, the number of environments initiated far exceeds C, yet the time expenditure cannot be further reduced. Based on these considerations, this section reanalyzes the principles of reinforcement learning operations and proposes a new architecture to improve the training process.
When applying reinforcement learning to solve this problem, the cumulative return and single-step reward are defined as follows:
Let
denote the state of UAVs at time
t.
encompasses the UAV’s position, its orientation, and the probabilistic map of target information. Let
represent the maximum expected reward attainable when the UAV is in state
. The mathematical formulation of
is as follows:
In essence, the original problem can be reformulated as follows:
If we know the form of the function
, then we can solve the original problem by simply solving the aforementioned problem, in which the agent only needs to interact with the environment for
t steps. This formula delineates the interactive process between the agent and the environment within the time interval
. The value of
can be obtained by solving the following problem:
This formula corresponds to the interaction process between the agent and the environment during the time interval . It can be observed that this problem shares the same structure as the original problem, thus allowing for the application of identical solution methods. In this case, one could directly employ reinforcement learning techniques to solve the problem, or alternatively, partition it into two intervals: from t to and from to T. The optimal policy for the t to interval can then be determined by estimating , ultimately yielding the value of .
This question arises: is it feasible to concurrently execute the processes of solving (
9) and (
10)? Intuitively, it would seem that the second process must invariably follow the first, as reaching state
necessarily requires first attaining state
, precluding parallel execution. However, if we initiate both processes simultaneously, with the first process remaining idle during the preparation phase while the second process allowing for the agent to interact with the environment
times, we can then simultaneously complete one simulation corresponding to (
9) and (
10) in the subsequent interaction (assuming
). This methodology can be delineated through Algorithm 1:
Algorithm 1 Asynchronous Exploration Algorithm |
Warm up stage: For do nothing For For t in Train stage: For each i in For , For t in 1…
|
It is important to note that this approach differs from using a single process to perform T simulations. Owing to the agent’s policy exhibiting some randomness in each interaction with the environment, the environment itself may possess inherent stochasticity. Consequently, the terminal state of the first process and the initial state of the second process may not be identical. Then, what impact does this difference have on the learning strategies of intelligent agents?
To illustrate this issue, we will use the Proximal Policy Optimization (PPO) [
26] algorithm, a prominent example from the field of reinforcement learning algorithms. Consider two solution outcomes in the solving process: actor and critic. The actor is responsible for the agent’s behavioral policy,
, accepting the drone’s state as input and outputting the drone’s action, i.e.,
. The critic is responsible for fitting
. If the PPO algorithm is employed to train the actor and critic, the parameter update formulas for the actor and critic are as follows:
In this context,
represents the parameters of the actor network, and
denotes the learning rate.
wherein
v denotes the parameters of the critic network, and
represents the discount factor, which is a commonly used parameter in reinforcement learning. In this particular problem, to ensure correspondence with the original formulation,
is set to 1. Let us first consider the learning process of the critic. Consider the fitting target of the critic as follows:
Although the ultimate goal of the
is to fit
, it first attempts to fit the expected reward under the current policy. Then, it improves the
’s policy by letting the
choose actions that can generate more expected rewards than the current policy. When the
’s policy is improved to the optimal policy, the
network will be able to fit the true
value. From Equation (
13), it can be seen that the update of the
depends on the
r value obtained from the interaction process between the
and the environment. Whether it is the (
9) process or the (
10) process, the intelligent agent can obtain the true
r from the interaction with the environment, thereby improving the accuracy of the
’s fitting of the expected return
.
Nonetheless, a challenge persists for (
9). Due to the original process, the following equation holds:
where
denotes the environmental model. However, for Algorithm 1, at
, it holds that
. This discrepancy leads to the inability of
to generate the labels for the critic, because for each state
of
, the corresponding critic label
is computed as follows:
Since is interrupted at , the data for cannot be obtained.
Fortunately, this issue is improved when the agent interacts simultaneously with multiple environments. Consider the set
, which represents the set of all states reached after
time steps from the initial state
. Thus,
and
. Although both are in the same set, due to the often large number of elements in
, the critic cannot infer
solely from
. However, if we have a large amount of data, suppose
m samples, we can get their label through the following equation.
Then, we can use these
m labeled data points to fit
using the neural network critic, where
. The fitted critic can then be used to obtain the value of
.
Regarding the actor, as per Equation (
11), the update of the actor is related to
and
, where
denotes the parameters of the actor, and
represents the advantage of executing action
in state
, which can be calculated by the following equation:
From the above equation, it is evident that the value of is related to the output of the critic and the actual reward r obtained by the agent. When the critic can be updated normally, the actor can also be updated accordingly.
Through the aforementioned analysis, we have demonstrated that when the number of environments is large, the interaction between the agent and the environment can be divided into two parts: the segment and the segment, which can run concurrently. By partitioning the interaction process, the process exhibits the same structure as the original problem. This suggests that if the number of environments is sufficiently large, the process can be further subdivided. Consequently, we propose the temporally asynchronous grouped environment reinforcement learning (TAGRL) method. This approach divides the environments into several groups. After a preparation phase lasting T, each group executes different time segments of an interaction episode.
Let
E denote the total number of environments, and
G denote the number of environment groups. Through the aforementioned design, the algorithm can allow the agent to interact with the environment
times before each update instead of
T times. Furthermore, compared to existing parallel methods (such as OC-MAPPO), the TAGRL method shows lower memory consumption. The key memory consumption component for both algorithms is the buffer, which stores the experiential data generated during the training process. In the OC-MAPPO method, the data storage requirement is as follows:
In contrast, the memory requirement for TAGRL is as follows:
where
N represents the number of agents. In the aforementioned calculations, we disregard the action and reward, as their memory usage is significantly less than that of obs and states (the latter being multidimensional tensors while the former are vectors). It is evident that, with the same number of environments, the TAGRL method requires less memory compared to the OC-MAPPO method.
4. Method
Based on the preceding analysis, we propose an asynchronous parallel reinforcement learning framework.
Suppose we have E environments, which we divide into
G groups (in
Figure 2, we use
as an example to illustrate the method). During the preparation phase, the first four groups of environments interact with the agent
times each, while the fifth group remains inactive. The preparation phase consumes the time equivalent to one complete round of interactions. Following the preparation phase, before each update of the agent’s policy, each environmental group interacts with the agent
times. During the Warm Up phase, for the
i-th environment, the following formula can be used to determine the number of steps the agent needs to interact with that environment.
In this context,
G represents the number of environmental groups, and
E represents the total number of environments. For environments that have not been terminated, the advantage value obtained by the agent is calculated as follows:
where
represents the expected reward obtained by the agent after executing an action, while
denotes the expected reward for the agent prior to action execution.
signifies the advantage value of a single step action taken by the agent, and
represents the cumulative advantage value of the agent.
For terminated environments, the advantage value obtained by the agent is calculated as follows:
After determining the A-values required for updating the agent, these can be substituted into the original formula to continue the agent’s training process.
Figure 3 below illustrates the advantages of this method and its distinctions from the original approach.The asterisk (*) denotes multiplication in figure.
In the most common practice, an agent sequentially interacts with a single environment for n rounds, wherein the time spent on agent–environment interactions is . If parallel environments are employed, utilizing more than n environments to simultaneously interact with the agent and collect data, the time spent on agent–environment interactions is reduced to T. However, adopting this approach, whether using n environments or environments, the agent–environment interaction time remains T.
In the TAGRL method, environments are divided into different groups. For instance, if there are environments, divided into 5 groups, each group contains environments. After the warm-up phase, each environment only needs to advance steps, thereby further reducing the interaction time between the agent and the environments.
This approach significantly enhances the efficiency of data collection and potentially accelerates the learning process. By strategically grouping environments and implementing a phased interaction strategy, TAGRL optimizes the trade-off between the breadth of environmental exploration and the depth of learning within each environment.
The following outlines the process of the TAGRL search Algorithm 2. The symbols utilized in this process are elucidated in
Table 2.
Algorithm 2 TAGRL Search Algorithm |
Initialize the parameter of actor network using orthogonal initialization Initialize the parameter v of network using orthogonal initialization Initialize total iterations I Let us denote as the time limit for drone search operations. Let us denote E as the number of environments Let us denote g as the number of environment groups Let us denote n as the maximum simulation turns for a drone Let us denote is u reconnaissance coverage. : each environment e in 1 …E Get from environments Calculate by Equation ( 21) each t in 1 … execute average filtering on by Equation ( 1) each UAV u = update and according to by Equation ( 3) update by Equation ( 2) : each i in 1 …n Get , from environments Set buffer b = [ ] each t in 1 … each environment e = 0 execute average filtering on by Equation ( 1). each UAV u = ∼ += update by Equation ( 2) = b+=[,,,,] Compute advantage estimate via Equations ( 22) and ( 23) on b Compute reward-to-go R on b and normalize Compute loss of by b, and Equation ( 11) Compute loss of V by Equation ( 12) Update ,V
|
5. Experiment
The experimental setup for this study consisted of a single NVIDIA GeForce RTX 4090 graphics card (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Core i9-14900K processor (Intel Corporation, Santa Clara, CA, USA). The software environment utilized Python 3.11.9. Neural networks were constructed using PyTorch 2.3.1, while the unmanned aerial vehicle (UAV) search environment was implemented using Numba 0.60.0.
To validate the effectiveness of the proposed method, we conducted comparative experiments with traditional approaches. In these experiments, both our method and the conventional approach utilized 256 environments to ensure equal computational resources. Both methods employed MAPPO [
27] to solve the established optimal control model(OC-MAPPO) [
25]. The primary difference was that our method (TAGRL) incorporated environmental grouping and asynchronous temporal interactions, while all other settings and net architecture remained consistent. To ensure a fair comparison, when evaluating the OC-MAPPO method, a discrete action design was implemented, wherein the UAVs’ movement patterns were limited to three options: forward, left, or right at each step. In contrast, for experiments demonstrating UAV trajectories, we employed the aforementioned continuous action model, where the UAVs’ movement patterns at each time point were determined by acceleration and angular acceleration. The
Table 3 presents the key parameter settings:
To mitigate the impact of random factors, each experiment was repeated 30 times, and error bands were plotted. The experimental results are illustrated in
Figure 4.
In
Figure 4, the orange error band represents the performance of the proposed method, while the blue curve indicates the performance when only using the parallel environment. It can be observed that agents in the TAGRL method learn faster and achieve higher average rewards. Furthermore, the TAGRL method and the OC-MAPPO method utilize the same number of computing cores, but the TAGRL method requires significantly less memory. This is because TAGRL stores only one-fifth of the interaction data per environment, whereas the OC-MAPPO method needs to store all the interaction data.
Subsequently, we conducted a comparative analysis of our method’s performance under varying numbers of parallel environments.
From the
Figure 5, it is observed that when the number of parallel environments is relatively low, the learning curve of the agent exhibits greater fluctuations and achieves lower reward values. This phenomenon may be attributed to the fact that during the learning process in (
9), the agent requires a stable
V function as a foundation. When the number of environments is limited, the agent obtains less data from each round of interaction with the environment, making it challenging to learn an accurate and stable
function. Conversely, when the agent interacts with a large number of environments, the
function learned by the agent more closely approximates the true expected reward, and the policy learned through the (
9) process more accurately reflects the correct policy.
To validate the comparative performance of our method under different search durations, we conducted experiments for four scenarios:
, and
. Each method was executed for 300 s, with the TAGRL method consistently employing five environmental groups. The optimal reward generated during the training process was selected as the representative reward for each method. The experimental results are shown in
Figure 6.
The OC-MAPPO method lacks values for the case of and due to memory overflow. As observed in the figure, when , the OC-MAPPO method performs comparably to TAGRL with fewer environments. However, as the number of environments increases, the MAPPO method demonstrates certain advantages. With further increases in the number of environments, TAGRL again exhibits performance similar to the OC-MAPPO method. When , TAGRL shows slightly superior performance to OC-MAPPO, with TAGRL significantly outperforming OC-MAPPO when , although this may be attributed to the stochasticity inherent in reinforcement learning. At , TAGRL demonstrates a more pronounced advantage, likely due to its shorter simulation time per round, as the OC-MAPPO method can complete fewer rounds in the same time frame. When , both the memory and time advantages of TAGRL become evident. The increase in simulation environments does not significantly constrain TAGRL in terms of memory usage, whereas the OC-MAPPO method experiences memory overflow at . Due to the further reduction in simulation rounds, the performance of the OC-MAPPO method becomes more unstable. Overall, when the reconnaissance time for UAV is longer, TAGRL exhibits superior performance compared to the OC-MAPPO method.
To further validate the robustness of the algorithm, we conducted a comparative analysis of its performance under varying numbers of UAVs. We selected scenarios with 2, 3, and 5 UAVs for our comparative experiments. The results of these experiments are illustrated in
Figure 7.
To validate the search strategy learned by the agent, we visualized the agent’s search trajectory and the changes in the target probability map during the search process. The drone’s search trajectory is composed of 200 discrete segments. The
Figure 8 below illustrates the search trajectories learned by the agent.
To better comprehend the search trajectory of UAVs, we segmented the UAVs’ route for display and selected probability maps of the target at various time points for presentation.
Figure 9 illustrates that initially, both UAVs move towards the target point. Subsequently, the UAV closer to the target loss point begins to rotate approximately around a point to the right and behind the loss point, while the other UAV circles around the left front. Following this, the UAVs continue to rotate on either side of the target loss point. During this period, the
Figure 10 indicates a significant decrease in the likelihood of the target appearing at or near the loss point. The red arrows in the figure indicate the forward direction of the UAV. However, after
, despite higher target probabilities in the upper and right regions of TPM, the UAVs fail to investigate these areas promptly. This observation highlights potential areas for improvement in the current method. Increasing training duration and expanding network parameters could potentially enhance the UAVs’ performance.
Despite the ability of agents to autonomously learn effective search strategies in dynamic long-term search tasks, current algorithms still have potential areas for improvement in application. Firstly, there exists an approximation error between the discrete TPM and the continuous world. Increasing the density of the grid in the TPM can reduce this error but also increases the algorithm’s memory consumption. A potential solution may be to focus on regions of interest within the TPM while compressing other areas. Secondly, the current research problems do not involve adversarial strategies for the target. The design of drone search strategies in scenarios where the target intentionally employs evasive tactics also deserves further consideration.Third, the current study assumes that UAVs fly at a fixed altitude, resulting in a constant scanning radius. However, in practical applications, UAVs can adjust their altitude to change their reconnaissance radius and precision. Removing this limitation may facilitate the design of more flexible search strategies.