1. Introduction
UAVs are crucial tools in various fields because of their easy deployment and flexible scheduling. They are utilized in emergency search and rescue operations [
1], agricultural irrigation [
2], and urban infrastructure development [
3]. Researchers have investigated the efficiency of drones in various scenarios. Drones can be beneficial in human rescue missions because of their speed and ability to navigate challenging terrains [
1], and they have the capability to assist in the exploration of unfamiliar and intricate situations [
4]. Moreover, artificial intelligence technology enables drones to carry out duties in place of humans. Drones possess equivalent target identification capabilities as humans thanks to object detection algorithms [
5]. Drones utilize machine learning algorithms to condense the rules for numerous activities and implement them in novel tasks. These innovations have expanded the potential applications of UAVs and increased the depth of their usage.
The use of multiple UAVs exhibits increased search efficiency and fault tolerance compared to a single UAV, as they can integrate each other’s information [
6] and search different directions simultaneously. The efficient cooperation of several UAVs in searching for targets is a significant research subject that has garnered considerable attention from scholars. Through strategic organization, several UAVs can cover distinct locations simultaneously, significantly decreasing the time needed for the search. During the search of the target area, the UAVs devise their routes autonomously to optimize the overall reconnaissance coverage simultaneously.
The issue of multi-UAV cooperative search can be categorized into static target search and dynamic target search based on the motion characteristics of reconnaissance targets. In a typical static target search situation, the objective is to explore a stationary target whose initial position is uncertain. The primary aim is to optimize the entire coverage area of the UAV. Typical dynamic target search scenarios include human rescue operations and criminal pursuits, where partial information about the target’s last known location, such as the vicinity of their disappearance, is available.
Dynamic target search is more complex than static target search due to the uncertain movement strategy of the target. When the UAV moves, it is necessary not only to consider the interactive information between the drones but also to consider the various possible movement directions of the target. In areas previously scanned by UAVs, it remains possible for targets to emerge. Furthermore, the asymmetry in the probability of the target appearing in the environment due to the initial position knowledge requires the development of specific algorithms to utilize this prior information effectively. In addition, because the target’s movement direction is unpredictable, it may take a long time to locate the target, resulting in an increased upper limit on the search duration. Because of the unique attributes of the dynamic target search problem, algorithms created for static targets are often not directly applicable. Therefore, it is essential to develop new algorithms specifically for dynamic target search problems. Traditional offline planning methods for dynamic target search, such as evolutionary algorithms [
7,
8], PSO (Particle swarm optimization) algorithms [
9,
10], and artificial potential field (APF) algorithms [
11], can generate an excellent search trajectory for a drone when the target is lost by leveraging heuristic information. However, this search trajectory is only applicable to the given search scenario, and when the location of the lost target changes, a new solution needs to be computed. This implies that when the lost location of the target changes, the algorithm requires some time to calculate the new search plan.
In contrast, reinforcement learning methods can train an agent to perform such tasks, and a well-trained agent can directly execute the search task when the target is lost without the need for further iterative optimization. Moreover, since the agent can receive real-time observations from the environment and then execute actions, even if unexpected situations occur during the execution of the task, such as a deviation in the initial position of the target or a deviation in the previously received information about the positions of other drones, the agent can continue to perform the task based on the latest information. Traditional methods that compute fixed trajectories cannot do this. Therefore, training an agent to perform search tasks using deep reinforcement learning algorithms may better align with practical requirements.
However, when training agents to perform a coordinated search for dynamic targets using reinforcement learning, there are still some remaining issues.
Firstly, owing to the uncertainty of target movement, the results of UAVs’ finding the target vary even when adopting the same strategy, which can make the rewards obtained by the drones more random. According to the literature [
12], intelligent agents face greater challenges in learning within stochastic environments compared to deterministic settings.
Additionally, before the drones discover the target, the only observation obtained by the drones is the position information of each drone, which can make the rewards obtained by the drones sparse. Sparser rewards can also make the learning process more challenging for intelligent agents [
13].
Additionally, to adapt to diverse scenarios under different settings, the intelligent agents require more training data than what are needed for specific scenario tasks during the training process. This necessitates longer interaction times with the environment.
These three factors may increase the difficulty of learning search tasks for agents and slow down the learning speed of the agent. Moreover, from a technical perspective, existing deep reinforcement learning methods can leverage deep learning libraries to execute the agent’s learning programs on GPUs fast. However, when the agent interacts with the environment, it needs to interact with the environment running on the CPU. Compared to GPUs, CPUs have fewer cores, which makes the interaction time between the agent and the environment a bottleneck in the algorithm’s execution. This is also a significant reason for the slow execution of reinforcement learning programs [
14].
To facilitate the agent in learning search tasks, this paper constructs an optimal control model that provides stable and continuous rewards to assist the agents in learning, thereby enhancing their learning efficiency. Furthermore, by developing models that run on GPUs, algorithms can benefit from the massive number of cores available on GPUs. Unlike the process before parallelization, where intelligent agents needed to interact with the environment over multiple rounds, parallelized agents only need to interact with the environments for a single round. The interaction data can be directly transferred for learning through the DLPack protocol without additional data transfers from the CPU to the GPU, further improving the training speed of the agents. Based on the research above, this paper compares several state-of-the-art reinforcement learning methods and provides experimental validation of the work conducted.
The rest of this work is organized as follows:
Section 2 outlines the current research status.
Section 3 introduces the problem background, and
Section 4 explains the optimal control model established.
Section 5 introduces a solution method for the model based on MARL (multi-agent reinforcement learning).
Section 6 presents the experimental validation of the work and discusses the results.
Section 7 offers a conclusion of this article.
2. Related Works
Significant progress has been achieved in the field of multi-UAV cooperative search through study. Cao and colleagues [
15] studied a hierarchical task planning method for UAVs with an improved particle swarm algorithm. This approach is appropriate for cases when the number of search tasks changes dynamically. Xu and colleagues [
16] proposed a multi-task allocation method for UAVs via a multi-chromosome evolutionary algorithm that considers priority constraints between tasks during allocation. Zuo et al. [
17] investigated the task assignment issue for many UAVs with constrained resources.
These methods concentrate exclusively on the search issue at a higher level. However, it is also necessary for detailed path planning or trajectory optimization crucial for UAVs to carry out a cooperative search. Y. Volkan Pehlivanoglu [
18] combined genetic algorithms and Voronoi diagrams in the study cited to propose a path-planning method for numerous UAVs in a 2D environment. Wenjian He and his colleagues [
19] employed an advanced particle swarm algorithm to develop a novel path-planning technique for many UAVs in a three-dimensional environment. Liu et al. investigated methods for path planning based on a set of points of interest [
20]. In addition, a reinforcement learning method was also applied to assist intelligent agents in searching static targets [
21] and mobile targets [
22]. Jiaming Fa and colleagues [
23] presented a path planning technique that utilizes the bidirectional APF-RRT* algorithm. Kong and colleagues [
24] examined task planning and path planning simultaneously.
However, in these studies, the location of the target is usually known. But, in specific search scenarios, such as rescuing lost travelers, the target’s location is often uncertain and needs to be found quickly by UAVs. In such scenarios, Peng Yao [
3] and others used an improved model predictive control method to solve the problem of how multi-UAVs can efficiently search for unknown targets, taking into account the communication constraints between UAVs. Hassan Saadaoui [
25] and others used a local particle swarm algorithm-based information-sharing mechanism for UAVs to help them quickly find targets in unknown environments. Liang et al. [
26] investigated how UAVs search for and attack targets in unknown environments. Samine Karimi and Fariborz Saghafi [
27] investigated a collaborative search method based on shared maps that is suitable for cost-effective systems. Xiangyu Fan et al. [
28] proposed a collaborative search method that combines inclusion probability and evolutionary algorithms for UAV applications. Furthermore, when multiple uncrewed aerial vehicles (UAVs) operate collaboratively, they may encounter unexpected situations, such as the loss of some UAVs. To address this issue and enhance the robustness of UAV systems, Abhishek Phadke and F. Antonio Medrano [
29] investigated methods for recovering the functionality of compromised UAVs.
Traditional methods for solving target search problems in unknown environments have poor scalability and run too slowly in large map areas [
30]. For large-scale search tasks, Chen et al. [
31] proposed a hierarchical task planning method that decomposes the overall search problem into smaller, more manageable subtasks. Hou Yukai and others [
30] proposed a target search method based on the multi-agent deep deterministic policy gradient (MADDPG) mentioned by [
32]. Because of the algorithm’s centralized training and distributed execution mechanism, the computation efficiency does not significantly decrease even with a large number of UAVs.
When employing reinforcement learning for unknown static target search, intelligent agents can directly interact with the environment and utilize the degree of exploration as a reward function to facilitate the learning process. But, when it comes to searching for a lost dynamic target, several new challenges arise. Firstly, because of the unknown motion strategy of the target, it is not feasible to directly establish a corresponding simulation environment. Secondly, the presence of prior information regarding the target’s lost location results in a non-uniform distribution of the target’s appearance probability, rendering the use of exploration degree as a reward function no longer viable.
To solve the issues mentioned above, this paper first models the dynamic target search problem as an optimal control problem, then regards the established optimal control problem as a simulation environment, and finally uses multi-agent proximal policy optimization (MAPPO) [
33] to solve the modeled problem, thus making full use of the powerful fitting ability of neural networks and the efficient optimization ability of reinforcement learning algorithms.
4. Optimal Control Model Construction
The search mission of the drone is to find the target in the shortest possible time, and once the target is found, the search mission ends. Therefore, except for the last moments, it can be assumed that the drone did not find the target during the mission. The optimization objective is to minimize the probability that the UAV fails to detect the target. Let us represent the collection of uncrewed aerial vehicles as
U. Note that the probability of the UAV finding the target during each reconnaissance is
and the area covered by the UAV during each reconnaissance is D; then, the probability of the UAV failing to detect the target is
, and the probability of all UAVs failing to detect the target is calculated as follows:
Suppose that the UAV scouts the target for N times from start to end, and the maximum reconnaissance time is
T. Then the probability that
N reconnaissance drones do not find the target is
. Taking the logarithm of this formula can change multiplication symbols into addition symbols, and because of the monotonicity of the logarithm function, taking the logarithm does not change the extreme point of the original formula. Taking the logarithm of this formula gives the following formula:
where
J represents the optimization objective. If the reconnaissance frequency of the UAV is high and the reconnaissance probability changes continuously with time, the performance index can also be written in the following continuous form:
Among them,
is calculated using the integral of the UAV reconnaissance probability over the reconnaissance area
.
is determined by the sensor model
and the drone’s position
. Let
denote the probability of the target appearing at
, and let
represent the reconnaissance accuracy of the UAV at coordinates
. Then, the calculation formula for
is as follows:
Here, optimal control theory is used to design a control scheme to regulate the UAV’s reconnaissance path in such a way that the above continuous-time performance indicators are minimized to optimize the search efficiency and maximize the probability of finding the target. Here, the state variables are the position of the UAV and the target probability map. The input of the system is the action matrix
taken by UAVs. Each row
a of the matrix
represents a sequence of actions taken by a drone. The transfer model of the system is as follows:
In this formula,
, which is defined on Equation (
2), represents the process of Bayesian update on the
based on the reconnaissance information obtained by the uncrewed aerial vehicle (UAV) at its new position. The term
, which is defined on Equation (
1), denotes the impact on the
caused by the expansion of the target’s activity range due to the increase in time. UAVs need to abide by some restrictions during the reconnaissance process and keep a certain distance between UAVs at all times to prevent collision or repeated observation. The turning angle of the drone is limited, so the drone can only turn in a certain range of the current direction of its fuselage. Note that the steering angle of the drone is
. Finally, the optimal control model of the problem is formulated as follows:
5. MARL-Based Solving Method
While minimal principle and dynamic programming are effective in solving various optimal control issues in engineering, the proposed model presents a challenge due to the combination of continuous and discrete variables, as well as the inclusion of nonlinear functions. Therefore, this study utilizes deep reinforcement learning to solve the optimum control model mentioned above. In this section, we discuss the MARL-based search optimization technique in three separate subsections. Initially, we introduce the MARL structure, which outlines the fundamental elements of MARL. Next, we introduce the basic knowledge about PPO (Proximal Policy Optimization) [
35], which is the single agent version of MAPPO. Finally, we provide the MARL-based search strategy and a comprehensive overview of the algorithm’s entire operation.
5.1. MARL Structure
Marl usually refers to the algorithm that performs the best actions from the observation through the appropriate reward design. When multiple intelligent bodies have a common global reward, agents gradually adjust their actions to achieve collaboration in order to maximize the rewards. In the usual MARL algorithm, agents interact with the environment to learn good strategy, but in this problem, because the opponent’s strategy is unknown, agents do not interact directly with the environment but with the mathematical model mentioned above. With the MARL, at time step t, each agent i takes an action based on its observation and receives a single-step reward () from the model. During the interaction with the model, the agent leverages the reward information to update its policy in order to obtain a higher cumulative reward. The key components of the multi-agent reinforcement learning (MARL) system are as follows:
5.1.1. Agent
Every UAV is identified as an agent inside the search system. The goal of the drone is to find a path that minimizes J.
5.1.2. State
The state variable x represents the global state, which contains the drones’ location information, angle information, and .
5.1.3. Observation
In order to distinguish each drone during the planning of drone actions, the observation not only includes state information but also includes the ID of each drone.
5.1.4. Action
Each agent
i selects the best course of action
according to the current observation. The UAV
u selects the steering angle as the action in the search environment, as seen below:
5.1.5. Reward
After the UAV performs the action, the target probability map is updated accordingly, and then the UAV receives the reward
r from the model. According to the established optimal control model, the UAV receives the reward at each step as follows:
The multi-agent search framework incorporating optimal control models can be represented by
Figure 3.
5.2. Basic PPO
The PPO algorithm uses an actor–critic design. Specifically, it uses a network of critics to assess the expected cumulative rewards that an agent can expect to earn based on its current state and uses actors to select appropriate actions for each agent. By using the output of the critic network as the reference value of the reward, the algorithm can approximate the reward range in real-world applications from [
,
] to [
,
], which can be further normalized. Through this mapping method, the algorithm can adapt to scenarios with different orders of rewards. The PPO algorithm improves the performance of agents by updating the policy through the following loss function:
Here, represents the probability of selecting action in state under the policy parameterized by , and denotes the probability under the old policy, before the update.
The use of the clip function ensures that the ratio of the new policy probability to the old policy probability does not deviate from 1 by more than , ensuring gentle updates. This clipping mechanism acts as a regularization technique, encouraging the new policy to stay close to the old policy and mitigating the risk of making overly large updates that can result in performance degradation.
The PPO algorithm iteratively performs the following steps:
Collect data from the current policy by interacting with the environment.
Compute advantage estimates based on the collected data.
Optimize the PPO objective with respect to for a fixed number of epochs, using stochastic gradient ascent.
Update the policy by setting to the optimized values.
5.3. OC-MAPPO Target Searching Method
The MAPPO method is the multi-agent version of the PPO method. In the MAPPO approach, all agents share the same rewards and utilize a centralized value function and localized policy functions. In the original MAPPO method, the loss function is calculated in the same way as in the PPO method. However, in this paper, we have made some modifications to the calculation formula of the loss function. In the original calculation formula, the denominator and numerator in represent the probability of taking action in state before and after the network update, respectively. However, in the multi-agent version, the original actually corresponds to in the multi-agent version, where n represents the number of agents, and the corresponding action changes from to . Therefore, we represent the probability of taking the action tuple under the observation tuple as the product of the probabilities of each agent taking the corresponding action.
The revised loss function
is computed as
Here, represents the probability of taking action when its observation is under the current policy , while denotes the probability under the old policy . The advantage function estimates the advantage of taking action a in state s compared to the expected return under the old policy.
The min and clip operations are used to limit the update step size and maintain stability during training. This loss function is used to update the actor network, while the critic network is updated using the mean square error.
Next, we create some additional designs for the actor network (used for fitting the policy function) and the critic network (used for fitting the value function) regarding the issues studied. Since the information to be input contains the two-dimensional data of the target probability map, the convolutional neural network is selected for data processing. At the same time, using the high-performance parallel multi-channel processing capability of modern computers, the incoming UAV position information is integrated into the two-dimensional data for unified processing. In order to distinguish between different drones, when inputting information for each drone’s movement, the ID information of the drone is additionally included.
For actor networks, the input data are first passed through a convolutional neural network, then flattened to one dimension, and then passed through three fully connected neural networks. For the critic network, the input data are processed similarly, except that the final layer of the actor network’s neural network combines the softmax activation function to map the input to the same number of probability vectors as the optional actions. The critic network, in combination with the tanh activation function, maps the data to individual V values. The initial parameters of all networks are orthogonal matrices. The main architecture of the actor network and the critic network is shown in
Figure 4.
In order to expedite the learning process of intelligent agents, we have developed a GPU parallel environment to facilitate more efficient interactions with the agents. The parallelized interaction approach exhibits the following changes compared to the non-parallelized method:
Prior to each update, the intelligent agent is only required to interact with each environment for a single round, as opposed to engaging in multiple rounds of interaction.
Each environment is configured with distinct parameters to enhance the generalization performance of the intelligent agent.
The overview of algorithms is depicted in
Figure 5:
UAVs work together to investigate their surroundings by changing directions at regular intervals, and the interaction data are retained for future training. The actor and critic networks are updated using historical samples. Then, the agent continues to interact with the environment using the updated actor network, repeating the previous stages until the training time is achieved. Finally, the drone chooses the path with the largest cumulative reward as the genuine search route to search.The overall flow of the algorithm is illustrated in Algorithm 1.
Algorithm 1 Search Algorithm Based on MAPPO and Optimal Control |
Initialize the parameter of actor network using orthogonal initialization Initialize the parameter of network V using orthogonal initialization Initialize total iterations I. Let us denote as the time limit for drone search operations. Let us denote n as the maximum simulation turns for the drone. Let us denote is u reconnaissance coverage. each i in 1 …n Set a different initial target location for each environment Set each environment’s drone positions and Obtain , from environments Set buffer b = [ ] each t in 1 … each environment e = 0 execute average filtering on . each UAV u = = += update by Equation ( 2) = b+=[,,,,,] Compute advantage estimate A via GAE [ 36] on b Compute reward-to-go R on b and normalize Compute loss of by b, A and Equation ( 14) Compute loss of V by mean squared error (v, r) Update , V
|
6. Experiment and Result
This section presents the experiments conducted on the proposed work from four distinct perspectives. Firstly, it examines whether the model enhances the learning speed of the agents and the accuracy of the model. Secondly, it analyzes the effects of GPU parallelization. Subsequently, it compares the performance of various reinforcement learning algorithms on this task. Finally, it verifies the adaptability of the OC-MAPPO method under different parameter settings.
6.1. Optimal Control Model Validation
To validate the effectiveness of the proposed model, a comparison was conducted between the MAPPO method combined with the optimal control model introduced in this paper and the pure MAPPO method. In the first approach, the agents interacted with the established model, and the reward obtained was the negative value of the cost function at a single time step, i.e.,
. In the second approach, the agents directly interacted with the environment, receiving a reward of 1 when a target was detected and a reward of 0 when no target was detected. Apart from this, all other parameters remained identical for both methods. Subsequently, 200 simulation experiments were conducted on the fully trained agent. In each experiment, the theoretical value of the agent’s discovery probability was first derived based on the rewards obtained by the agent. The computation formula is given by
, as we amplified the reward by a factor of ten during training. Then, the agent’s reconnaissance trajectory was fixed, and the target’s movement route was randomly varied 1000 times to calculate the statistical probability of the agent discovering the target. Finally, the theoretical value and statistical probability were compared to verify the accuracy of the model. The parameters of the MAPPO method are shown in
Table 1.
The experimental results are shown in
Figure 6 and
Figure 7. In
Figure 6, the horizontal axis represents the number of training iterations for the intelligent agent, while the vertical axis represents the likelihood of the intelligent agent discovering the target. These values were obtained by averaging the results of 100 interactions between the intelligent agent and its environment. Furthermore, because of the inherent randomness in the training process of reinforcement learning, each method was experimented with 30 times, and the corresponding error bars were calculated and are shown.
In
Figure 7, the
x axis represents simulation times, and the
y axis indicates the score calculated by the reward and score from the simulation. From the graph, it can be observed that the two types of scores often have similar values. When the agent’s simulation score decreases, the agent’s theoretical score also exhibits a corresponding decline. Regarding the two significant drops in the agent’s scores, they occur because the agent is still running in exploration mode, meaning that during each run, there is still a certain probability of executing suboptimal actions. In practical applications, this issue can be avoided by setting the agent to a deterministic mode.
The results shown in
Figure 6 and
Figure 7 indicate that the optimal control model proposed in this paper is effective and efficient.
6.2. GPU Parallel Model Checkout
To validate the effectiveness of the established GPU parallel model, we conducted two comparison experiments. The first comparison was made between the OC-MAPPO method combined with the parallel GPU model and non-GPU parallelized model in a task scenario with a fixed target’s start position. In the first method, 3 × 256 GPU cores were utilized to update 768 mathematical models simultaneously, while the interaction rounds between MAPPO and the environment were reduced to one round. On the other hand, the second method involved CPU updates for 10 mathematical models with 40 interaction rounds. The second comparison was made between the OC-MAPPO method with 10 environments, which can be parallized in the CPU, and the OC-MAPPO method with a 3 × 256 environment, which needs to use a GPU because of the limited core of the CPU, in scenarios with various target’s start position. The purpose of the second experiment was to compare the effect of a GPU parallel environment in a more challenging scenario. Apart from the number of parallel environments, all other parameters remained consistent with the experiments conducted in the first part.
The experimental results are shown in the
Figure 8 and
Figure 9.
Figure 8 indicates that the MAPPO method, incorporating GPU parallelization, demonstrated significant advantages after 20 training epochs and consistently maintained a higher discovery rate thereafter. We speculate that this can be attributed to the increased interaction between the agent and the environment. Additionally, from the second subplot, it is evident that the GPU parallelization method only took approximately half the time compared to the second approach. Therefore, we can conclude that the utilization of GPU parallelization significantly enhances the learning efficiency of the agent.
Figure 9 illustrates that, in a task scenario where the initial position of the target varies, employing a greater number of parallel environments enables the agent to learn more stably and efficiently. Conversely, when only 10 parallel environments are utilized, the learning process of the agent becomes highly convoluted, and the learning pace is significantly diminished. The results shown in
Figure 6 and
Figure 7 indicate that the optimal control model proposed in this paper is effective and efficient.
The results shown in
Figure 8 and
Figure 9 indicate that parallel GPU environments can significantly enhance the learning speed of the agent.
6.3. Comparison of Deep Reinforcement Learning Algorithms
Currently, the field of deep reinforcement learning is primarily divided into two branches: value-based methods and policy gradient-based methods. This experiment compares the advanced Dueling Double Deep Q-Learning (D3QN) [
37,
38] method from the value-based branch and the MAPPO method from the policy gradient-based branch in order to explain why the MAPPO method was chosen to solve the model in this paper. The parameters for D3QN are shown in
Table 2, while the parameters for MAPPO are the same as in the previous experiment.
Figure 10 illustrates the performance comparison between D3QN and MAPPO. In the left panel, the
y axis remains consistent with
Figure 6, while the horizontal axis has been altered to represent the training time required by both methods because of their distinct updating mechanisms. As evident from the graph, the MAPPO approach begins to surpass the D3QN method at approximately 50 s and maintains a consistent advantage thereafter. In the right panel, the horizontal axis represents training time, and the
y axis indicates the mean reward. The general trends in the right graph are essentially consistent with those in the left graph. However, in the initial stages of the left graph, the MAPPO algorithm exhibits a sudden surge in score. This discrepancy arises because the left graph represents the best performance achieved by the agents during each round of interaction with the parallel environment, whereas the right graph measures the average performance. In both panels, the MAPPO algorithm ultimately demonstrates superior performance compared to D3QN methods. Consequently, this study employs the MAPPO technique to resolve the proposed uncrewed aerial vehicle reconnaissance problem. Based on the information in
Figure 10, it can be concluded that the MAPPO method demonstrates superior performance and a more stable learning process compared to the D3QN method.
6.4. Comparison of MAPPO and Offline Planning Methods
In comparison to offline planning methods, such as genetic algorithms, employing deep reinforcement learning techniques enables the pre-training of intelligent agents capable of online action, allowing the agents to maintain their ability to act even when the target location changes. This study aims to compare the proposed method with genetic algorithms to observe the performance of solutions obtained by genetic algorithms and the intelligent agents trained using the presented approach when the target location is altered. The parameters for the GA method are shown in
Table 3, while the parameters for the MAPPO method remain the same as in the previous experiment. In both methods, the departure position of the drone is located at the lower left corner.
The experimental results are shown in
Figure 11 and
Figure 12, where the coordinate values represent the deviation of the target’s initial position from its initial position in the training scenario. The center point (0, 0) in the graph represents the scenario where the target’s starting position during testing is consistent with its position during training. The color of each point becomes increasingly red as the probability of detecting the target increases.
Figure 11 reveals that as the x-value changes from negative to positive and the y-value transitions from negative to positive, the color shifts from blue to red and then back to blue, indicating that the probability of target detection initially increases and subsequently decreases. In
Figure 12, when the target’s starting position coordinates are less than those of the training position, the UAV departs from the lower left corner, resulting in the target being closer to the UAV. Consequently, the colors in the graph become more red, signifying an increased probability of the UAV detecting the target. Only when the target’s distance is greater than the training position does the probability of the UAV detecting the target exhibit a decline.
In summary, the probability of the genetic algorithm’s solution detecting the target rapidly decreases when the target’s initial position changes. In contrast, the OC-MAPPO method, benefiting from a large number of interactions in parallel environments, maintains its probability of discovering the target when the target position changes but remains relatively close to the agent’s spawn location. This demonstrates the reinforcement learning algorithm’s greater robustness and broader applicability.