1. Introduction
Exploration of Mars presents myriad opportunities for advancement in fields such as innovation, culture, and technology. The endeavor to decipher the geology, climate, and potential for harboring life, either in the past or present, on Mars has catalyzed considerable interest and investment, resulting in the deployment of various robotic systems with specialized capabilities and applications [
1,
2]. Owing to its potential for previous life, Mars has been a significant focus of astrobiological studies, positioning it as an ideal target for human exploration and potential settlement [
3]. For many years, researchers have continuously sought to address the complex challenges associated with extraplanetary missions, particularly focusing on key issues such as planetary landing techniques [
4,
5] and stable hovering operations [
6]. Historically, the exploration of Mars commenced with the utilization of rovers and unmanned ground vehicles (UGVs) [
7] tasked with navigating the Martian terrain [
8], collecting data, and performing scientific experiments. Nevertheless, the slow operational pace and limited sensing range of these rovers have constrained their utility [
9]. For example, the Spirit rover, which represents one of the most successful missions to date, traversed a mere 7.73 km over a six-year span [
10]. This limitation underscores the imperative to re-engineer exploration systems to enhance efficiency and expand operational coverage [
11].
Recent technological advancements have introduced unmanned aerial vehicles (UAVs) [
12] as a promising solution to enhancing the efficiency and scope of Mars missions. UAVs provide a significant leap in capability over traditional rovers due to their ability to cover vast areas quickly, access challenging terrains, and gather high-resolution data from multiple perspectives [
13]. UAVs are advantageous for planetary exploration due to their ability to move over the surface and through the atmosphere, enhancing cost-effectiveness and efficiency compared to wheeled rovers or static explorers. Their applications include surface exploration, determining potential rover paths, identifying human landing sites, and hazard detection [
14]. The inclusion of UAVs in Mars missions, such as NASA’s Perseverance rover mission with the deployment of the Ingenuity drone, exemplifies this technological shift and the potential for UAVs to revolutionize planetary exploration [
15].
The transition from single robotic systems to integrated multi-robot systems, which combine UGVs and UAVs, marks an advancement in Mars exploration strategies. Multi-robot systems are valued for their ability to enhance mission efficiency and accuracy through task division [
16], collaborative area coverage, and real-time data sharing among multiple agents. By distributing tasks and coordinating efforts, these systems accelerate coverage and reduce mission duration and introduce redundancy; if one UAV fails, others can seamlessly continue the mission, minimizing the risk of overall failure [
17]. In addition, the UGV can function as a central computational hub, managing and synchronizing the entire system. In a related study [
18], an integrated UAV-UGV system was developed for data collection in cluttered environments, where the UGV functions as the central controller for autonomous navigation, while the UAV supports the UGV by surveying areas inaccessible to the UGV, thus improving data collection efficiency. The benefits of employing multi-agent systems extend beyond just reliability. They enable simultaneous coverage of larger areas while aggregating substantial environmental data [
19], which is essential in exploration missions [
20]. Furthermore, the cooperative nature of these systems allows UAVs to learn from and build on the experiences of their peers, increasing learning speed and reducing redundant explorations [
21]. This cooperative behavior is important in dynamic environments like Mars, where real-time adaptability is critical for mission success. The Mars 2020 mission [
11], featuring both the Mars Helicopter Scout and the Mars Rover, exemplifies this shift toward coordinated multi-robot systems. By enabling complex operations such as coordinated mapping and sampling, which are difficult for a single unit to perform, multi-agent systems improve the robustness and resilience of missions, especially in unpredictable environments. Additionally, cooperative UAV-UGV planning maximizes mission efficiency by allowing UAVs to rely on UGVs as mobile charging stations and operational bases, addressing the energy constraints that are critical for long-duration Mars exploration [
22]. Furthermore, UAV-UGV cooperation greatly improves navigation and mapping in GPS-denied environments like Mars, as demonstrated through SLAM-based technologies [
23]. In another work, tethered UAV-UGV systems have been shown to optimize path and trajectory planning, ensuring synchronized movements across Mars’ challenging terrains [
24]. Similarly, UAV-UGV cooperation has been demonstrated to optimize pathfinding and data collection, even in cluttered environments like Mars. This system architecture enables automated path planning and navigation using SLAM, improving mission efficiency by enhancing both 2D and 3D data collection capabilities [
25].
A key challenge in Mars exploration is the coverage path planning (CPP) problem. CPP [
26] seeks to determine optimal paths to ensure full area coverage with minimal overlap and efficient time management. The goal is to cover the entire target area while addressing factors such as obstacles, limited battery life, and communication constraints. Given Mars’s vast and varied landscape, optimal routes must maximize coverage while reducing time and resource usage [
27]. Recent studies aim to improve CPP algorithms to address the unique challenges of the Martian environment, including large obstacles, harsh weather conditions, and potential communication loss during missions [
28]. In many cases, CPP methods operate UAVs independently, leading to inefficiencies. Without proper coordination, UAVs may follow overlapping paths, miss certain areas, or even collide, which increases mission time and decreases coverage efficiency [
26]. Furthermore, few CPP methods incorporate fault tolerance, a critical feature for Mars exploration given the harsh environment and the high likelihood of communication loss among UAVs. Such disruptions can jeopardize the mission and cause significant failures [
29].
Recent research focuses on developing CPP methods capable of operating in model-free environments and enabling collaboration among UAVs to explore uncertain terrains in real-time. These methods enhance system robustness and adaptability. Many approaches use neural networks, ant colony algorithms, and other advanced technologies for multi-agent online CPP in dynamic settings. However, most do not consider the observations and states of other UAVs, limiting their effectiveness. A CPP method has been proposed based on an information map that tracks joint coverage records, but it did not account for the entire coverage process to achieve a globally optimal solution. It is crucial that these methods function autonomously and remain fault-tolerant due to the challenging conditions on Mars [
30].
In recent years, reinforcement learning (RL) methods have gained prominence in autonomous decision making include UAV path planning, especially for CPP problems in both known and unknown environments [
31,
32,
33]. RL-based approaches are particularly suitable for Mars exploration due to the uncertain and dynamic nature of the environment. Autonomous agents must make real-time decisions with limited prior knowledge, and RL enables them to learn optimal policies through interactions with the environment [
34]. Q-learning, a widely used RL algorithm for path planning and coverage optimization in discrete spaces, reliably converges to optimal solutions but is susceptible to the curse of dimensionality in more complex environments [
12]. To address this limitation, deep neural networks (DNNs) are used to approximate Q-values in large state-action spaces, leading to deep Q-learning (DQN) [
35].
Multi-agent systems like [
36,
37] present new opportunities and challenges when applying RL. One major advantage of RL in multi-agent systems is the ability for agents to learn from their interactions, adapt to dynamic environments, and improve coordination. However, multi-agent RL (MARL) also introduces complexities, such as the need to coordinate actions, divide tasks optimally, manage shared resources, and maintain adaptability in changing environments [
38]. In multi-agent cooperative coverage scenarios [
39], achieving real-time, fault-tolerant planning [
40] remains particularly challenging. Centralized RL algorithms generally offer better communication and faster convergence by allowing agents to learn as a unified system. However, they are vulnerable to single points of failure and scalability issues, especially when the number of agents increases [
41]. Decentralized RL algorithms, on the other hand, provide greater robustness against individual agent failures and allow for more flexible and scalable systems. Yet, they require higher computational resources for each agent and face challenges in achieving global coordination. An optimal approach would combine centralized exploration, which ensures comprehensive learning, with decentralized exploitation, which reduces resource demands on each agent while retaining flexibility [
42]. A popular framework that combines both approaches is centralized training with decentralized execution (CTDE). In this model, agents learn collectively during training but operate independently during execution [
43]. Furthermore, cooperative exploration methods, where agents share a common exploration objective, help to avoid redundancy and improve overall performance. However, coordinating these efforts requires careful design to avoid conflicts [
44]. Strategies like role assignment and hierarchical coordination can optimize policies by improving agent-level interactions and cooperation in complex environments [
45,
46]. Regarding the specific MARL methods, actor–critic approaches, such as asynchronous advantage actor–critic (A3C), have shown promise by combining the strengths of value-based and policy-based techniques, enhancing both stability and exploration. However, these methods require significant computational resources and can be difficult to scale efficiently in multi-agent settings [
47]. Similarly, the QMIX algorithm is widely used in MARL tasks because it combines individual Q-values into a joint Q-value using a mixing network, while ensuring monotonicity. However, QMIX is fully centralized during training, making it susceptible to single-agent failures, which can disrupt the entire system and reduce its effectiveness in fault-tolerant scenarios. While methods like QMIX have advanced multi-agent coordination through CTDE frameworks, they face challenges in scaling to larger systems or operating under unreliable communication—conditions commonly encountered in Mars exploration. The dependence on centralized value functions during training creates bottlenecks and vulnerabilities; if the central entity fails or communication is interrupted, system performance can degrade. Additionally, the monotonic value function constraint of QMIX can restrict its ability to handle more complex tasks [
48].
To address these challenges in Mars exploration, we utilize a UGV system as a central computational hub, with UAVs dedicated to coverage path planning. To achieve this, we propose a novel method called SERT-DQN (similarity-based experience replay enhanced target deep Q-network), which utilizes the simplicity of DQN with the coordination capabilities of QMIX. In SERT-DQN, primary Q-networks for each UAV, implemented using Multi-layer perceptron (MLPs), handle decentralized decision-making locally within each UAV. Meanwhile, the intermediate Q-network, responsible for joint Q-value calculations, is centralized and computed within the UGV. This structure facilitates cooperation within the UAV-UGV system, reduces conflicts, and enhances overall performance. By integrating centralized strategic oversight with decentralized autonomous actions, SERT-DQN optimizes both exploration and exploitation. The algorithm aligns individual training with joint training, ensuring that individual decision-making is consistent with both local and global objectives. This alignment helps maintain coordination among agents, leading to more cohesive decision-making across the system. As a result, the method improves sampling efficiency and enables multi-agent systems to benefit from shared knowledge, leading to higher success rates and faster convergence. This framework is particularly well-suited for environments characterized by uncertainty and operational challenges. To further enhance performance, SERT-DQN incorporates a similarity-based experience replay mechanism that prioritizes experiences based on state similarity, focusing on relevant data while minimizing the influence of less pertinent or misleading information. This selective experience replay promotes balanced exploration, faster convergence, and improved stability. Additionally, SERT-DQN is designed with fault tolerance in mind. The enhanced version, SERT-DQN+, ensures that UAVs maintain effective path planning and mitigate overlap issues during communication disruptions until the connection is restored, which is critical for ensuring mission efficiency in Mars exploration.
This paper proposes a multi-agent fault-tolerant online cooperative CPP method aimed at optimizing fault-tolerant coverage paths and minimizing task completion time for Mars exploration. The main contributions are as follows:
Enhances UAV cooperation by expanding local observations into global observations, improving decision-making and coordination.
Introduces a fault-tolerant algorithm for online coverage path optimization that handles connection losses while avoiding overlaps, uncovered areas, and collisions. The approach maintains low computational demands on each UAV, enabling fast online coverage path planning.
Improves sample efficiency and accelerates convergence by prioritizing relevant experiences, optimizing the learning process.
The structure of this paper is organized as follows:
Section 2 covers the essential preliminaries;
Section 3 explains the system model and problem formulation;
Section 4 describes the proposed algorithm;
Section 5 presents simulation results and discussions; and
Section 6 concludes the paper while outlining potential future research directions.
3. System Framework
In this section, we discuss the method of autonomous CPP by UAVs and a central UGV. In this work, we introduce coverage maps for each UAV. These maps are designed not only to enable accurate modeling of the environment but also to allow each UAV to access observations collected by other UAVs. This modeling aids in better organization and coordination within the UAV-UGV system. The problem of coverage path planning for multiple UAVs is then formulated as an optimization problem, with the objective of minimizing mission completion time and reducing area overlap. To ensure effective online collaboration among the UAVs, each UAV operates autonomously while coordinating with the central UGV. This collaboration aims to gather and share environmental information, resulting in the development and expansion of environmental observations.
We have configured a fleet of UAVs comprising
UAVs, represented by the set
. The target area is defined as an
rectangle.
Figure 1 illustrates the online coverage path planning process by the UAVs. The map presents a two-dimensional grid representation of the three-dimensional environment used.
In the introduced system, UAVs use aerial cameras to cover the target area. The environment for these UAVs is unknown and contains a number of obstacles. The UAVs start their activity from initial positions and proceed step-by-step with the goal of minimizing mission completion time while full coverage path planning. Each UAV’s coverage path consists of a sequence of points defined over multiple moves. At the beginning of each move, each UAV makes its flight decision online at the grid point it occupies, moving toward one of its neighboring points. As shown in
Figure 1, each position in the grid has four neighbors. Therefore, each UAV’s coverage path consists of multiple connected grid points. Additionally, UAVs can utilize information obtained from their environment during decision-making, which results from the collaboration among multiple UAVs.
The UAVs are homogeneous and do not differ in performance. They have the same flight speed and operational space. Moreover, the camera detection radius projected onto the ground and the communication radius for each UAV are identical. The target area’s environment remains constant during the mission and contains fixed, unknown obstacles. The initial path position of a UAV can be at the edge or inside the target area, while the final position must be on the edge for easier retrieval. The camera used for ground coverage is mounted underneath the UAV. The field of view of this camera is circular, referred to as the UAV’s detection area. The detection radius
depends on the UAV’s fixed flight height
and the aerial photography angle
, which can be calculated using the following formula:
3.1. UAV Mathematical Model
3.1.1. Environment Modeling Using the Communication Data Matrix
For each UAV, the obtained images are modeled frame by frame. Therefore, the target area is modeled using a grid based on the camera’s field of view for each UAV. To ensure accurate coverage, the grid map units are divided according to the inner rectangle of the UAV’s circular detection area. Since the inner square covers the largest area among rectangles inscribed in a circle, we propose a grid where cells are set as inscribed squares within the detection range to fully utilize the detected environmental information by the UAV.
To achieve better communication between UAVs and the central UGV, we introduce a communication system. For this purpose, we propose a communication data matrix (CDM), which employs a value-based representation method to track coverage records and identified obstacles within the improved grid structure. The CDM is designed to allow each UAV to share its data with the central UGV and neighboring UAVs. The UGV collects all the maps, updates them, and shares the information with all UAVs. Additionally, each UAV can exchange its CDM with neighboring UAVs. This ensures that UAVs consistently move toward uncovered grid cells during the mission while avoiding obstacles and updating others on their locations via the CDM. This collaborative approach enhances the UAVs’ ability to support each other in path planning, ultimately reducing task duration.
By mapping the relationship from the 3D environment to a 2D representation, the target area is divided into an improved grid map consisting of
rectangular regions, where
is calculated as
and
. Here,
and
represent the actual width and length of the target area, respectively. The geometric center of each grid cell, referred to as the grid point or
-point, is represented by the grid map coordinates
, where
and
. This definition ensures that when
is positioned above the
-point of a grid cell, that cell is fully covered. The current grid points for
is denoted as
, and its target grid point is denoted as
. The position of
is given by
where
is the position index function, and the position of
is determined similarly. We define
as the grid centered at
so that the entire grid map can be represented as the set
, with
denoting the set of grid cells located at the edges of the grid map. If
has been covered, then
; if not,
. Additionally, if the grid contains obstacles, then
. Based on these values, the CDM perceived by
is represented by a matrix CDM:
in which
represents the environmental information value for grid
as perceived by
. The value at the current grid location
is denoted as
. The process of merging and updating environmental information is defined as follows:
3.1.2. Coverage Time
In a defined system, the overall task completion time is determined by the UAV that takes the longest to complete its assigned mission. UAVs commence their tasks from distinct starting positions and move iteratively to cover their designated target areas. Each UAV is responsible for a specific segment of the area, referred to as a subtask, and the union of these subtasks ensures complete coverage of the target area. Due to differences in initial positions, obstacle distribution, and other environmental factors, the workload for each UAV may vary. Consequently, the time taken by the last UAV to complete its subtask dictates the total task completion time for the system.
According to the updated CDM, the time
for
to complete its subtask is expressed as follows:
where
represents the total number of steps required by
to complete its subtask,
is the final step in each episode for each UAV,
denotes the set of target positions for each movement,
is the Euclidean distance between the starting grid point
and the target grid point
for
during the
th movement. Moreover,
is the UAV’s flight speed. The overall task completion time for the UAVs is equal to the maximum time taken by any single UAV to cover its assigned path.
3.1.3. Target Area Coverage
To quantify the coverage of the target area achieved by the UAVs, we introduce the concept of the effective coverage area for
. This area is defined as the difference between the total coverage area and the overlap area. Specifically, the effective coverage area,
, which is the coverage area,
, minus the overlap area,
, is represented as follows:
In this formulation, is the set of all the grids point which covers or has overlaps. can represent either coverage or overlap. The function calculates the number of elements in the set, and represents the area of each grid cell.
3.1.4. Optimization-Based Problem Modeling
Maximizing coverage efficiency is equivalent to minimizing task completion time and reducing overlap in coverage. To achieve this goal, the UAVs must collaboratively optimize their paths, including the optimization of both the number of movements and the target positions for each UAV. Consequently, the multi-UAV CPP optimization problem, based on the defined system framework, is formulated as follows:
These constraints ensure that (1) no UAV operates beyond its endurance limit; (2) the current and target positions of each UAV remain within the designated grid map of the target area; (3) collisions between UAVs are prevented; (4) UAVs effectively avoid grid cells containing obstacles; (5) each UAV terminates its subtask at a boundary grid point; and (6) the fleet of UAVs collectively achieves complete and discrete coverage of the target area. It should be noted that the effect of Mars’s curvature on the calculation is not taken into account.
3.2. Mathematical Modeling of Proposed Method
In this section, we utilize a Markov game to formulate the CPP problem for a multiple UAVs-UGV system. In this model, the state space, action space, and reward function for system are defined. Although the Markov game framework typically includes a transition function, in this work, the transition function is not explicitly modeled due to the use of model-free RL.
3.2.1. State Space
In reinforcement learning methods for CPP, each UAV defines its state space based solely on perceived environmental information, disregarding information from other agents. These issues can negatively impact the task completion time and coverage efficiency in the multi-agent system.
To avoid these problems, each UAV needs to cooperate with others to gain a comprehensive observation of the environment, which includes the latest records of covered areas and the positions of detected obstacles. Additionally, each UAV must consider not only its own target location but also the target locations of other UAVs to avoid collisions. Ultimately, UAVs can make flight decisions based on these factors to reduce task completion time and minimize overlap, thereby improving coverage efficiency.
At each step, collaborates with other UAVs to obtain their movement maps and target positions. It then updates and combines its map based on the mentioned equations. After that, makes its decision based on the local state space , defined for each at movement as follows: , where represents the UAV’s position and is the communication matrix of UAVs and each respective target positions . This state is individual for each UAV.
The global states
responsible for joint states, includes the positions of each
, their respective target positions
, and the
:
3.2.2. Action Space
The action space for each is discrete, denoted as , while the joint action space for all agents is represented by A, which includes the set of actions of each at step k. The action set is identical for all UAVs. When is in state , its available target grid points are the four adjacent locations: , , , . If is positioned at the edge or corner of the target area, actions that would lead it outside the boundary are excluded.
The action set for each can be represented as The joint action space A for all UAVs is the Cartesian product of the individual action spaces: , where N is the number of UAVs.
3.2.3. Reward Function
The reward function should be designed to incentivize each UAV to fully cover all obstacle-free grid points with the minimum number of movements. On the one hand, actions that lead to exploring unknown areas should be rewarded to ensure complete coverage of the target area without missing any grids. On the other hand, actions that lead to overlapping or collisions should be penalized, as they may result in redundant movements or task failure. The reward function for each UAV is defined as follows:
Thus, the reward function is defined to encourage actions leading to exploration and full area coverage while penalizing actions that cause overlap or collisions, with the goal of minimizing task completion time.
The joint reward function
for all UAVs at step k is the sum of the individual rewards of each UAV:
4. Proposed Algorithm for CPP Problem
This section presents our proposed algorithm, the similarity experience replay enhanced target deep Q-network (SERT-DQN). First, we outline the key details of the SERT-DQN algorithm, followed by a discussion of the similarity-based experience replay mechanism. Next, we introduce an enhanced version, SERT-DQN+, designed to handle fault-tolerant scenarios. Lastly, we model the environment using a Markov game framework to further support the algorithm’s applicability.
4.1. Overview of the SERT-DQN
The SERT-DQN algorithm integrates reinforcement learning principles with our proposed coordination strategies, utilizing two Q-networks to optimize UAV path planning. In this section, we delve into the details of the SERT-DQN algorithm, explaining how it can work with two Q-networks and take advantage of both global and individual learning. The SERT-DQN algorithm enables each UAV to operate autonomously based on local information while synchronizing its actions through a global evaluation framework. This global evaluation is achieved by integrating a primary network for each UAV with an additional intermediate network shared for all UAVs to the DQN framework which facilitates coordination among all UAVs. This combined approach is designed to enhance coverage efficiency and reduce task completion time by leveraging both local and global data. The algorithm is centered around a UGV, which serves as a central control system. Unlike traditional CTDE approaches, where centralized coordination is limited to training, our method extends the UGV’s role into the execution phase.
The SERT-DQN consists of two Q-networks. One is called the intermediate Q-network, which is responsible for calculating the joint value using joint shared experience replay. The other is the primary Q-network for each individual agent, using individual experience replay for each agent and shared similarity-based experience replay. Each primary Q-network’s loss function is updated based on both individual and joint Q-network parameters. This ensures that each UAV’s decision is based on both individual and global states. The action selection is based on each primary Q-value, allowing each UAV to decide individually while considering both global and individual states. This configuration makes SERT-DQN scalable and computationally efficient, allowing it to support larger UAV fleets and more complex environments without performance loss, making it a robust solution for multi-agent reinforcement learning in UAV networks. The schematic diagram of the method is shown in
Figure 2.
Knowledge transfer in MARL aims to enhance learning efficiency by sharing policies or experiences among agents or across tasks [
53,
54], facilitating inter-agent knowledge transfer and accelerating convergence. While these approaches focus on transferring knowledge to new tasks or between agents, our SERT-DQN algorithm differs by enhancing coordination within the same task through a similarity-based experience replay mechanism. Rather than transferring learned policies, we prioritize relevant experiences based on state similarity, improving sample efficiency and cooperation among agents without the overhead of transferring knowledge across different domains. This distinction is crucial for Mars exploration, where agents operate collaboratively in the same environment and must adapt to dynamic conditions in real-time.
The summary of the SERT-DQN elements is as follows:
Individual primary Q-networks: These networks guide local decision-making for each UAV based on immediate environmental data.
Mutual intermediate Q-network: This network facilitates collaboration by aggregating and processing data across multiple UAVs.
Target network for primary Q-network: This target network provides stable Q-values for primary which are calculated based on the mixed of the primary q-network and intermediate Q-network.
Target network for intermediate Q-network: This ensures stable training by periodically updating towards the intermediate network’s weights.
4.1.1. Intermediate Q-Network
The intermediate Q-network in the SERT-DQN framework plays a pivotal role as it calculates joint Q-values that account for dependencies among the individual Q-values of each UAV. It updates at a constant interval shorter than the primary Q-network, offering a global perspective that enhances coordination and aligns individual UAV actions with overall mission objectives. This alignment reduces conflicts and improves system performance. Additionally, the less-frequent updates balance timely global feedback and computational efficiency, enabling the system to scale effectively while ensuring coordinated operations across the UAV fleet.
The calculation of the
is based on the QMIX. QMIX is a method in multi-agent reinforcement learning that ensures the shared Q-value is a monotonic function of the individual Q-values of each agent. This is achieved through a mixing network that combines the individual Q-values using a function constrained to monotonicity. This constraint ensures that increasing the Q-value of any agent cannot decrease the shared Q-value. In other words, if one UAV decides to improve its performance and increase its Q-value, this change is applied in a way that always increases or maintains the shared Q-value and never decreases it. This is important because it ensures that cooperation between UAVs is maintained and there is no competition that undermines overall performance. The monotonicity constraints defined as
. The primary Q-values for each UAV are combined using a mixing function
to calculate
. The weights of this function are dynamically generated by a hypernetwork (
) based on the global state S. The input to the hypernetwork is the global state S. This global state is fed into a fully connected layer that extracts relevant features from the global state. In our proposed method, the hypernetwork includes two hidden layers, each fully connected. These layers are used to process state information and transform them into a more abstract representation. To introduce non-linearity into the network and learn complex patterns, we have used ReLU activation functions. The output layer of the hypernetwork generates the weights used in the mixing function. To ensure these weights are non-negative, ReLU activation functions are used. Finally, the mixing network combines the weighted individual Q-values into
. It consists of a single fully connected layer that sums the contributions from each agent, using weights provided by the hypernetworks:
where the weights from the hypernetwork
are given by
Thus, the
is formulated as
where
,
, and
are parameters for each
, mixing network, and
, respectively. In addition, in this architecture, the first and second layers use non-negative weight matrices,
and
, along with bias vectors,
and
, which help refine the output and ensure monotonicity while optimizing the learning process.
4.1.2. Intermediate Target Network
To ensure stability in the
, we use a target network called intermediate target network. The parameters of the target network, denoted by
, are periodically updated to provide stable learning targets. This periodic update mechanism is crucial for maintaining stability in the learning process and ensuring coherent convergence of Q-values. It updates as follows:
The loss function for updating the parameters of the intermediate Q-network is designed to minimize the squared error between the predicted intermediate Q-values and the target values. The loss function is given by
where
represents the target value, defined by the following Bellman equation:
4.1.3. Primary Q-Network
Each UAV has its primary Q-network responsible for local decision-making. The primary Q-network for each UAV is implemented using a multi-layer perceptron (MLP) [
55]. This architecture ensures that each UAV can process its local state information and make informed decisions. The simplicity of the MLP architecture also reduces computational load for each UAV. The input to the primary Q-network consists of a local state vector
for
at movement k and a corresponding action
. The primary Q-network for each UAV is designed with two hidden layers. The first hidden layer applies a ReLU activation function to the weighted input; the second hidden layer processes the output with a similar ReLU activation. The final output layer computes the Q-value for the given state-action pair, helping the UAV make informed decisions. Further details about the specific hyperparameters and network architecture are provided in the experiment section of the paper.
4.1.4. Q-Primary Target Network
In this algorithm, the network is responsible to evaluate and stabilize the value. Its parameter introduces based on each parameter and parameter, to help balance between global and local information. Unlike the intermediate target network, which is updated based solely on collective behaviors, this serves as a balance point between individual optimizations from primary Q-networks and collective insights from the intermediate target network.
The
network in the SERT-DQN framework ensures cohesive fleet behavior by balancing local and global information, preventing UAVs from acting solely in self-interest and reducing conflicts such as overlapping coverage or redundant paths. This network also enhances adaptability, allowing UAVs to adjust strategies based on dynamic environments and shared experiences. The integration of individual and collective insights stabilizes the learning process by minimizing policy fluctuations. Furthermore, it aligns individual actions with group objectives improves overall mission efficiency, ensuring optimal resource allocation and comprehensive coverage. The Q target value at movement k is calculated using a combination of Q values from the
and the
:
where
includes the states of each
after the next action. Each agent’s primary Q-network updates its parameters by minimizing the mean squared error between the predicted Q-values and the target Q-values provided by the
:
4.1.5. Updating Intermediate and Primary Network Parameters
Stochastic gradient descent (SGD) [
56] adjusts the parameters of each primary Q-network,
, and the intermediate Q-network,
, to minimize their respective error functions. The update process proceeds as follows:
represent the learning rates of the primary network and intermediate network, which determine the extent of parameter adjustments at each update step. Using an appropriate learning rate can help achieve faster and more stable convergence of the algorithm. The SGD method optimizes parameters by calculating the gradient of the loss function concerning the parameters and making small adjustments towards minimizing the loss. This iterative process continues until the loss function reaches its minimum possible value, and the parameters reach their optimal values.
4.2. Action Selection
During action selection, each UAV chooses its action using an epsilon-greedy policy [
57] and the Q-value function computed by its primary Q-network. This method ensures that action selection is based on local evaluations updated in line with common actions. In other words, each UAV selects its action as follows:
In the epsilon-greedy method, UAVs select a random action with probability and select the best action based on Q-values with probability . In the early stages of learning, there is a higher probability of selecting random actions to better explore the environment and gather more experiences. Over time, as learning improves, the value of decreases, and UAVs increasingly select the best action based on Q-values.
4.3. Experience Mechanism for the Q-Intermediate
The experience of the multi-UAV system at a given time step is captured in a shared tuple , where all parameters are considered jointly across the UAVs. The global state represents the aggregated states of all UAVs at time , reflecting the overall condition of the system. The joint action consists of the combined decisions made by all UAVs from their respective action spaces at time . Upon executing these actions, the system receives a collective reward , which quantifies the immediate outcome of the actions and guides the learning process by highlighting favorable or unfavorable behaviors across the entire system. The tuple concludes with the next global state , which captures the updated state of the system after the joint actions have been executed. These shared experiences are stored in a joint replay buffer , which is accessible to all UAVs. During the training phase, data are randomly sampled from this joint buffer in mini batches, enabling the function to learn in a more stable and coordinated manner. All these shared parameters, , , and , are equivalent to the set of individual parameters for each UAV.
4.4. Similarity Experienced Replay for the Q-Primary
For improved exploration, we propose a shared experience replay mechanism that allows each UAV to randomly sample not only from its own experiences but also from the experiences of other UAVs. However, despite its advantages, using a shared experience replay buffer can introduce non-stationarity issues. This occurs when agents encounter states and actions they have not personally experienced, leading to difficulties in learning appropriate policies. To address this challenge, we propose a cosine similarity-based state sampling mechanism to enhance the relevance of sampled experiences for
networks, thereby stabilizing the learning process and promoting effective exploration. To improve the relevance of sampled experiences, we introduce a cosine similarity-based sampling method. This method prioritizes experiences that state are more relevant to the current state of each UAV; it guarantees more stable and effective learning. For each
, the cosine similarity between its current state
and the states stored in the shared experience replay buffer is measured. Cosine similarity is defined as
To reduce computational demand, a subset of experiences (100 sample) is randomly selected from the shared experience replay buffer before applying the cosine similarity measurement. Within this subset, experiences are tiered based on their similarity scores into three categories: similar tier (top 10% most similar); less-similar tier (next 20% moderately similar); and non-similar tier (bottom 70% least similar).
The experience selection process is based on a tiered sampling mechanism that prioritizes experiences from the most relevant tier while still incorporating less-similar and non-similar experiences to encourage exploration. The similar tier, which comprises experiences most relevant to the current state of the UAV, is sampled with the highest probability (80% of the time). To form a mini-batch for training, we determine the number of experiences to sample from each tier based on the desired mini-batch size. For each sampled experience from the shared buffer, a comparison is made against a corresponding experience sampled from the UAV’s individual replay buffer. The comparison is based on the temporal difference (TD) error; the experience with the highest TD error among the samples is selected for training. It calculated using the following formula:
We compute the TD error for both the shared experience and the local experience. The experience with the higher TD error is selected for inclusion in the final training mini-batch. For the less-similar tier, which contains experiences moderately relevant to the current state, sampling occurs less frequently (15% of the time). For each selected experience from this tier, a comparison is made against a sample from the similar tier and a sample from each agent’s individual replay buffer. The experience with the highest TD error is selected for training. This approach allows the model to explore slightly different experiences while still maintaining relevance, contributing to more comprehensive learning.
The non-similar tier is sampled occasionally (5% of the time) to introduce more diverse and exploratory experiences into the learning process. For each selected experience from this tier, the same comparison process as with the less-similar tier is used. However, if the TD error of the non-similar experience exceeds a predefined threshold, indicating that it is too different and potentially harmful to the learning process, it is discarded. Otherwise, the experience with the highest TD error is chosen for training. This threshold ensures that while exploration is encouraged, experiences that are too far from the UAV’s current state do not destabilize the learning process. The pseudo code of the proposed method is presented as Algorithm 1.
Algorithm 1: SERT-DQN Algorithm |
Initialize parameters For to do: Initialize environment and obtain initial local states for each and global state For time step to do: For each do: With probability : Select random action from action space With probability : Select action Collect joint action Execute joint action in the environment Observe next local states and individual rewards for each Update CDM and global state Compute joint reward Store individual experience in End For Store shared experience in Training Updates: For each do: Sample mini-batch from using Similarity-Based Experience Replay: For each experience in do: Compute target and loss Update End For End For Sample mini-batch from For each experience in do: Compute target and loss Update End For Update Target Networks Periodically: If then: For each do: End For End If Update States: For each do: End For Adjust Exploration Rate by End For (time steps) End For (episodes) |
4.5. Fault-Tolerant Version of the SERT-DQN
This section presents a fault-tolerant extension of the SERT-DQN algorithm. The primary objective of the proposed method is to enhance the robustness and fault tolerance of UAV operations in scenarios where individual UAVs may lose connection to the central controller. In this part, we introduce mechanisms to handle scenarios where one or more UAVs lose connection to the central UGV. The fault-tolerant extension, SERT-DQN+, is designed to handle communication losses that may occur during the execution phase of the mission.
4.5.1. Fault-Tolerant Q-Intermediate Approximation for Lost UAV
When a UAV loses connection to the central UGV, it needs to approximate the
values to continue making informed decisions. The proposed method achieves this through a combination of the last known QMIX values from the central UGV and an averaging mechanism involving the lost UAV’s
. When connected, each UAV updates its
based on the standard SERT-DQN approach. Also, the
values are calculated centrally and disseminated to all UAVs. Upon losing connection, a UAV approximates its
values using the following formula:
This balances the UAV’s own
with the last known QMIX values to approximate the intermediate Q-values based on the concept of the value decomposition method [
58].
4.5.2. Centralized Q-Intermediate Approximation for Disconnected UAVs
For the central system in the UGV, when a UAV is disconnected, as the update is less frequent, it uses last values of the lost UAV to calculate the central to let other UAVs know about the prediction of the next states of the lost-UAV to avoid overlap. The proposed fault-tolerant method enhances the robustness of UAV operations in case of a lost connection. If the connection did not connect after considered time, the lost UAV will go back to the UGV, and calculation will continue without it.