1. Introduction
Aerial robots have become increasingly indispensable in various applications. One of these applications is in search and rescue (SAR) operations, especially in post-disaster scenarios where human intervention is often fraught with danger or impracticality. These missions are typically conducted in environments with unpredictable environmental conditions [
1,
2]. The foremost objective of SAR operations is the rapid location of targets and the execution of critical follow-up actions, such as relaying information and delivering essential supplies, all within a constrained timeframe. Employing aerial robots in SAR missions offers numerous benefits, including their swift deployment capabilities, cost-effective maintenance, exceptional mobility, and the capacity to operate in areas where manual intervention limits risks or requires rapid decision-making processes [
3]. Challenges include navigation among numerous obstacles, efficiently locating and assisting survivors, potential damage to ground system infrastructure, and limitations due to the aerial robot’s battery capacity [
4].
Path planning is a crucial component in these contexts, entailing the formulation of an optimal trajectory from the origin to the destination while adhering to operational constraints and mission objectives. Traditional path planning methods, such as grid-based and graph-based algorithms (such as A* [
5], artificial potential fields [
6], and Dijkstra’s algorithm [
7]), have demonstrated efficacy in stable scenarios [
8]. However, they often struggle in the unpredictable and dynamic terrains characteristic of disaster zones, primarily due to limitations in real-time adaptability and autonomous decision-making in the face of complex obstacles and environmental uncertainties.
With the advent of advancements in artificial intelligence (AI), particularly in machine learning (ML) and reinforcement learning (RL) [
9], new solutions have emerged to address these challenges. RL, with its adaptability and learning-based approach [
10], is particularly suitable for dynamic and uncertain post-disaster environments. RL systems have shown promise in dynamically adapting to changing terrains and unforeseen obstacles [
11]. Deep reinforcement learning (DRL) [
12], which incorporates deep neural networks into RL, has been explored for its effectiveness in complex and multifaceted scenarios [
13,
14]. Specifically, DRL’s application in aerial robot path planning has led to remarkable improvements in multi-objective environments, such as navigating through disaster-stricken areas [
15].
Aerial robots are constrained by limited flight duration, primarily due to their reliance on battery power. Consequently, strategic planning of flight routes and scheduling recharge stops become essential to ensure successful mission completion. Energy-aware navigation frameworks for aerial robots are designed to address this challenge. They focus on providing efficient route planning that not only circumvents obstacles but also optimizes battery usage, thereby enhancing the operational range and effectiveness of aerial robots in various applications [
16]. Bouhamed et al. [
17] developed a framework utilizing the deep deterministic policy gradient (DDPG) algorithm for aerial robot navigation. This framework is designed to efficiently guide aerial robots to designated target positions while maintaining communication with ground stations. Moreover, Imanberdiyev et al. [
18] proposed a methodology that focuses on monitoring critical aerial robot parameters, including battery level, rotor condition, and sensor readings, for enhanced route planning. The approach involves dynamically adjusting the aerial robot’s flight path as necessary to facilitate battery charging when needed. An autonomous aerial robot path planning framework utilizing the DDPG approach was developed to train aerial robots by the authors of [
19]. This framework enables aerial robots to effectively navigate through or above obstacles to reach predetermined targets.
Despite these advancements, DRL models often struggle to manage multiple objectives simultaneously, such as evading various targets and optimizing different objectives. Managing several goals simultaneously often leads to complex and larger state spaces and inefficiencies due to the conflicting nature of objectives. To overcome these challenges, there is a need for distinct models for each objective [
20,
21]. Other frameworks, such as meta-learning-based DRL and modular hierarchical DRL architectures, are suggested to address the computational demands and recalibration needs in complex, multi-objective scenarios [
22].
Hierarchical reinforcement learning (HRL) can overcome these challenges of the reinforcement learning [
23]. Inspired by human thinking on solving complex problems, HRL not only breaks down the problem into sub-problems that are easier to handle but has the ability to train multiple policies that are connected at different levels of temporal abstraction. HRL offers a structured approach for tasks involving multiple objectives, by segmenting decision-making into different layers [
24]. Its application in aerial robot navigation has included coordinating multi-objective missions, exemplified in recent studies where HRL has been employed to optimize task allocation and path planning [
25,
26].
In addition, despite HRL’s advantages in managing complex tasks, HRL faces several critical problems, including sparse rewards, long time horizons, and the effective transfer of policies across different tasks. However, the most significant challenge is ensuring that high-level policies assign feasible and well-defined tasks to low-level policies [
27]. Inconsistent or poorly defined tasks can lead to inefficiencies and failure in task execution. Algorithms like hierarchical proximal policy optimization (HiPPO) aim to address this by jointly training all levels of the hierarchy and allowing continuous adaptation of skills even in new tasks [
28]. However, HiPPO itself faces challenges, including the complexity of simultaneously training multiple levels of policies, the potential for increased computational demands, and the difficulty in maintaining stability and convergence during the training process.
To address this challenge, this paper proposes a novel method for aerial robot path planning in SAR missions. Our work introduces a novel integration of adaptive long short-term memory (LSTM) networks for real-time battery consumption prediction within a hierarchical reinforcement learning framework. This integration offers several unique advantages:
- -
The LSTM model assists the high-level policy in selecting feasible goals for the low-level policy by predicting the battery requirements for each goal, ensuring that the chosen goals remain within the robot’s energy constraints.
- -
Our model incorporates a bidirectional LSTM framework for accurate battery consumption prediction. This dual-layer structure processes data sequences in both forward and backward directions, enhancing the model’s ability to understand context and sequence dynamics, thus providing more accurate predictions of energy requirements for each target.
- -
By forecasting battery consumption for each target, the LSTM model informs the high-level policy (HLC) within the HRL framework, enabling more informed and energy-efficient goal selection. This model dynamically adjusts flight paths based on real-time battery predictions, enhancing the robot’s effectiveness in complex, multi-objective missions. By providing energy consumption forecasts, our framework ensures mission completion without energy depletion, increases the overall success rate of SAR operations, and avoids mid-mission energy shortages, thereby enhancing mission success rates.
- -
The use of hindsight experience replay at the LLC level improves learning efficiency and robustness in changing environments. This accelerates the convergence of learning algorithms, enabling quicker adaptation to dynamic environments and reducing the training time required for effective path planning.
- -
The proposed framework includes an adaptive mechanism that gradually reduces reliance on LSTM predictions as the HLC learns from environmental interactions. This transition enhances the HLC’s autonomy and efficiency in mission planning, optimizing energy usage based on real-time learning and experiences.
This study aims to investigate energy-aware path planning for aerial robots tasked with delivering supplies to multiple targets while avoiding obstacles. Path planning is conducted over two hierarchical levels. An LSTM network first predicts the battery consumption for each target, informing the high-level HRL policy’s goal selection. The selected goal then guides the low-level navigational decisions, enabling the aerial robot to identify efficient paths while avoiding obstacles. Additionally, the LSTM’s predictions assist the RL algorithm in making energy-efficient decisions by forecasting battery consumption for upcoming states. To address non-stationarity at the low level, we employ hindsight experience replay in the low-level policy. In the HRL framework, a spectrum of RL algorithms is strategically deployed to bolster the functionality of both the high-level controller (HLC) and the low-level controller (LLC). Among these, the soft actor-critic (SAC) [
29,
30] algorithm is known for its unparalleled adaptability and efficiency, rendering it an optimal choice for navigating the complexities of aerial robot navigation within high-dimensional continuous action spaces. The SAC algorithm is also celebrated for its incorporation of entropy regularization, a feature that inherently promotes an exploratory stance. This aspect is crucial as it guarantees a comprehensive exploration of the state space, thereby facilitating the development of versatile strategies for aerial robot navigation. The off-policy nature of SAC significantly enhances learning efficiency by leveraging past experiences, a critical advantage in scenarios where real-time responsiveness is paramount. Therefore, we incorporate the SAC into our HRL framework. We evaluate our algorithm’s efficiency in an aerial robot tasked with delivering food to multiple survivors, focusing on energy efficiency and optimizing time in paths that require obstacle avoidance.
The rest of the paper is organized as follows:
Section 2 contains the works that relate to our paper,
Section 3 presents the preliminary and problem formulation, and
Section 4 introduces the proposed HRL method. In
Section 5, the experiments and results are shown.
Section 6 brings the final remarks and proposes future work.
3. Problem Definition
This research addresses the multifaceted challenge of optimal path planning for aerial robots in the critical context of post-disaster missions. The objective is to deploy an aerial robot to deliver vital supplies to survivors whose locations are predetermined yet situated in an uncertain post-disaster environment. The aerial robot is equipped with LIDAR for obstacle detection, GPS for positioning, and IMU sensors to track velocity.
The goal is to design a path for the aerial robot that satisfies time efficiency, energy consumption, and collision avoidance constraints. Furthermore, each successful delivery results in a decrease in the aerial robot’s payload weight, a factor that introduces variables affecting the aerial robot’s flight dynamics and energy efficiency. The problem formulation is presented as follows:
Given a set of targets
, the aerial robot must determine a path
, where
denotes the trajectory to target
such that the total operational cost
is minimized. The cost encompasses the combined metrics of time
, energy
, and the collision avoidance
, represented as a weighted sum in the objective function:
subject to the following constraints:
- -
Collision avoidance: Each trajectory segment must adhere to safe navigational practices, as determined by onboard LIDAR and GPS data, to avoid collisions.
- -
Energy consumption: The energy function considers the variable weight due to payload changes, as well as aerial robot-specific power consumption profiles.
- -
Time efficiency: The time function is a measure of the temporal efficiency of each trajectory, with a lower indicating a more desirable path.
- -
Safety: The safety function is a measure of the aerial robot’s adherence to collision avoidance throughout the mission, with higher values indicating safer operations.
- -
Battery constraint: The trajectory must be completed without the need for recharging, imposing a natural limit on the length and complexity of the path.
The optimization problem is not a convex optimization problem due to the nonlinear nature of the energy and safety functions and the discrete nature of collision avoidance constraints.
To achieve this, drawing from the insights gathered through a thorough literature survey and recognizing the inherent advantages of hierarchical approaches, we propose to address the problem by leveraging goal-conditioned HRL. One of the main challenges in the context of HRL is based on the high level assigning goals for the low level, which is primarily due to the potential for the HLC to assign unattainable goals [
53]—because they are implausible or beyond the capabilities of the LLC. In instances where the HLC requests an unattainable goal, the LLC tends to persist in its efforts until a timeout is reached, incurring a notable cost in terms of interactions with the environment. This issue is particularly noticeable during the initial stages of training as the HLC may frequently select unattainable goals without accounting for the associated resource costs [
54]. Based on our mission, a pivotal challenge for unfeasible goals emerges in the context of battery-level feasibility for goal attainment. This challenge is intricately linked to the HLC capacity to set realistic and achievable goals, given the aerial robot’s limited battery life. While the aerial robot’s targets are predetermined and accessible, ensuring that the aerial robot can reach these targets without exhausting its battery reserves is critical. This necessitates the integration of a battery consumption model within the HLC, capable of accurately estimating the power requirements for reaching each delivery target and returning to the recharge station.
To overcome these challenges in this paper, we propose an approach using goal-conditioned HRL, integrated with an external pretrained LSTM-based battery consumption prediction for energy-aware decision-making. Given that LSTM has the potential to predict the battery consumption model of the aerial robot based on current states for subsequent states, it aids the LLC in making better decisions for the next action. This is based on the battery consumption predictions, enabling more energy-efficient decision-making.
Additionally, goal-conditioned HRL systems face the challenge of non-stationarity at the low level. This challenge arises as the high-level controller adjusts goals, which can shift the landscape that the low-level controller must navigate, affecting its ability to maintain consistent policy performance [
55]. To mitigate it, the framework utilizes the hindsight experience replay (HER), which addresses the low-level policy’s non-stationarity arising from goal changes dictated by the HLC. This adaptation assists the HLC in refining its decision-making process, based on previous failures to achieve feasible goals, thereby enhancing its ability to set more attainable goals for the low level. The schematic algorithm of the proposed method is illustrated in
Figure 1.
3.1. Goal-Conditioned Markov Decision-Making
In formulating the RL problem, we should formalize it by leveraging the principles of Markov decision processes (MDPs) [
56]. By adhering to the constructs of MDPs, we can articulate the RL problem with precision and clarity, thereby facilitating its systematic analysis and resolution within the domain of decision-making under uncertainty. In HRL settings, the MDP [
13] is typically augmented with a set of goals
. This augmentation is pivotal to aligning the MDP with goal-oriented tasks.
The goal set changes the standard MDP into a goal-conditioned MDP. This new framework mixes objectives with the state space directly. It leads to a focused decision-making process. Now, the agent’s actions respond to both the environment and specific goals. In this version, we consider a tuple where represents the state space; signifies the action space; encompasses the set of goals that guide the agent’s policy; is the transition probability function, dictating the state evolution; is the goal-dependent reward function; and is the discount factor, indicating the value of future rewards.
3.2. Soft Actor-Critic Algorithm
In this section, we introduce the foundational principles and mathematical underpinnings of the SAC algorithm, a cornerstone of our proposed HRL framework. SAC, an advanced off-policy algorithm in the domain of deep reinforcement learning, excels in managing complex and high-dimensional control tasks. It is particularly lauded for its ability to strike a harmonious balance between exploration and exploitation, achieved through the incorporation of entropy into the reward optimization process. This attribute is crucial for navigating continuous action spaces with a high degree of efficiency and sample economy. Moreover, adding an entropy term to the objective function helps ensure that policies do not become overly deterministic too quickly. This encourages exploration and helps avoid premature convergence to suboptimal policies, enhancing stability. The entropy term promotes sufficient exploration of the policy space, which is critical for navigating the complex and non-convex optimization landscape effectively [
57,
58]. The SAC objective function for the policy
π is given by:
where
denotes the state-action distribution under policy
π,
is the reward function,
α stands for the temperature parameter controlling the trade-off between reward and entropy, and H signifies the entropy of the policy.
The core components of SAC, instrumental in operationalizing its objectives, are the action-value function
, the state-value function
, and the policy
, governed by which are defined as follows:
The policy is updated by minimizing the expected KL divergence to a target policy, effectively adjusting the policy toward actions that maximize the expected sum of rewards and entropy. The optimization of these components is achieved through iterative updates, leveraging stochastic gradient descent and experience replay for efficient learning.
SAC’s introduction of the entropy term into the optimization objective ensures a more exploration strategy than traditional reinforcement learning algorithms. By dynamically adjusting the policy to encourage exploration, SAC mitigates the risk of premature convergence to suboptimal policies and enhances the algorithm’s adaptability to diverse and unpredictable environments.
In operationalizing SAC, neural networks
and
are employed to represent the action-state value function
and policy
, respectively. The iterative refinement of these entities is facilitated through minibatch sampling from the experience replay buffer. Furthermore, the introduction of a target network for both
and
enables soft updates, markedly enhancing the stability of the learning process. The optimization of SAC’s framework is governed by two distinct loss functions, tailored to refine the critic and actor networks. Critic network and actor network losses are given by:
in which
4. Methodology
In this section, we outline the methodology. Initially, we investigate the LSTM battery prediction, followed by the elucidation of our HRL framework and its integration into the system.
4.1. LSTM Battery Prediction
Predicting battery consumption accurately is critical for optimizing the aerial robot’s path planning and mission execution. LSTM networks, known for their ability to capture long-term dependencies and temporal patterns in sequential data [
59], are particularly effective for this task. Recent advancements in LSTM applications, such as those demonstrated by [
60,
61], highlight their effectiveness in energy consumption prediction.
Our proposed model for battery consumption prediction incorporates a bidirectional LSTM (Bi-LSTM) framework, as illustrated in
Figure 2. A Bi-LSTM layer is a neural network architecture that extends the traditional LSTM networks by introducing a dual structure to process data sequences in both forward and backward directions simultaneously [
62]. This bidirectional approach allows the network to capture temporal dependencies from both past (backward) and future (forward) states, thereby enhancing its ability to understand context and sequence dynamics.
Each Bi-LSTM comprises two separate LSTMs, each processing the data in opposite directions—one forward and one backward—with their respective parameters and hidden states, which are then concatenated at each time step that leads to a more robust representation of temporal sequences. Bi-LSTM layers are particularly effective in tasks where the context of the entire sequence is crucial for accurate predictions [
63]. The incorporation of two Bi-LSTM layers in our LSTM architecture enhances the model’s capacity to capture and interpret complex temporal dependencies in sequential data. The first Bi-LSTM layer effectively captures immediate, short-term temporal patterns, while the subsequent layer, receiving the output of the first, can discern more abstract, higher-level temporal relationships in the data. This ability makes the model particularly suited for complex sequential modeling tasks where understanding both past and future contexts is vital [
64].
The input layer includes aerial robot-specific operational parameters including aerial robot velocity (), payload weight (), historical battery measurements (), current battery level (), and the position of the aerial robot. These parameters heavily influence battery consumption in aerial robot operations. It is important to note that our analysis does not consider flight and environmental conditions, like motor efficiency and wind, and it is assumed that the altitude is constant. Additionally, the model’s scope is limited to cruising phases of aerial robot operation, excluding aspects of takeoff and landing maneuvers.
Following Bi-LSTM layers, a dropout layer is designed to counteract overfitting. The dropout rate serves as a hyperparameter, adjustable during the model’s validation phase for optimal performance tuning. After the data pass through the dropout layer, they are then processed by a dense layer. This dense layer serves to further interpret the features extracted by the bidirectional LSTM layers and to consolidate the information into a form that is suitable for output prediction. This dense layer uses a ReLU activation function and has 128 cells.
The output layer uses a linear activation function to predict the upcoming battery level and determines the battery requirements necessary for reaching each target. Each of the LSTM layers contains 128 hidden cells. The input data are processed in time windows of a length of 20, and they are matched with the , which ensures the model’s predictions are synchronized with the HLC updates.
Regarding the mathematical formulation of the proposed LSTM model, we define an input vector in for each time instant , which includes key inputs. The aerial robot’s energy consumption at a time is represented as . Our goal is to develop a predictive function that can accurately predict energy consumption across a specified time window. The core of our analysis involves minimizing the total squared differences between the actual () and predicted () energy consumption. This objective, aimed at enhancing the accuracy of energy usage predictions for aerial robot operations, is formalized by optimizing the function , ensuring accurately predicts from .
For this problem, the mean squared error (MSE) is employed as the loss function and the Adam optimizer is selected as an optimizer. The hyperparameters are shown in
Table 1. The hyperparameters were chosen based on the minimum averaged MSE as these parameters were identified to yield optimal results during the 5-fold cross-validation phase.
4.2. Proposed Hierarchical Reinforcement Learning Framework
4.2.1. Goal-Conditioned Hierarchical Reinforcement Learning
This framework employs a two-level policy structure where the HLC, represented as , determines goals based on the environmental state and the LLC, represented as , scores actions to achieve these goals. The HLC is responsible for setting feasible goals for the LLC. One important difference between the HLC and LLC learning process is that the LLC generates transitions at every single time step through the primitive aerial robot actions, while the transitions of HLC are produced over a slower time scale through a sequence of goal selections.
The HLC is tasked with the strategic selection of intermediate objectives, referred to as feasible goals (), based on the agent’s current environmental state (). These goals are set within the domain of the goal space , which, in our study, includes locations of multiple targets.
Transition dynamics in the HRL framework are critical. They define how the agent moves from one goal to the next. Mathematically, this transition is represented as:
Equation (6) denotes the transition mechanism from one high-level goal to the next, occurring at intervals of time steps. Here, represents the frequency of updates from the HLC, indicating how often the high-level goals are reassessed and potentially altered based on the evolving state of the environment.
Between these transition points, the high-level goal adaptation function,
comes into play, defined as:
where
represents the differential mapping from the state-goal pairs to the goal space
.
The LLC undertakes the execution of these high-level objectives by generating a sequence of actions in response to the current state and high-level goal . The efficacy of these actions is evaluated based on the feedback from the environment, encapsulated in the reward signal . The LLC performs immediate goal-conditioned actions based on the current state and the goal , with actions optimized at every time step. The low-level policy aims to achieve the set goals within the c-step timeframe, guided by an intrinsic reward function. This process will continue until the target is reached or one of three scenarios occurs: either an obstacle collision occurs or the aerial robot is unable to reach the target within a predefined maximum number of time steps or the aerial robot reaches the required charge for the goal.
HLC and the LLC receive rewards separately while interacting with the environment. The HLC receive extrinsic reward from the environment for choosing the best goal based on its states for the LLC. The LLC receive intrinsic rewards conditioned by the goal specified by the HLC. The HLC optimizes policy
to select goals
based on the state
, aiming to maximize a combination of expected cumulative rewards and policy entropy. The reward
for the HLC incorporates factors like mission success and the feasibility of goals considering the predicted battery consumption. The reward function for
is formulated to accumulate rewards over a fixed interval of
time steps. Equation (8) describes the reward function for the HLC. It accumulates the rewards over
time steps, with
representing the discount factor, a measure of the importance of future rewards.
The primary goal of the high-level policy is to optimize the expected cumulative reward, which is expressed as:
In contrast, the low-level policy aims to achieve the set goals within the c-step timeframe, guided by an intrinsic reward function. This function is designed to maximize the expected return related to the achievement of the intermediate goals:
4.2.2. Soft Actor-Critic for Proposed Hierarchical Reinforcement Learning Augmented with LSTM Battery Prediction
In this framework, we employ the off-policy RL algorithm SAC to train policies at both the HLC and LLC. The SAC optimizes the policy for selecting targets that maximize mission success and efficiency at the HLC layer and that are operatively focused. Also, the LLC employs SAC to optimize the aerial robot’s navigation toward the chosen targets and manage specific tasks efficiently.
The HLC is responsible for making decisions on selecting reachable targets, considering states and the predicted battery for each target from the LSTM. The LSTM model predicts battery consumption for different targets to help the HLC policy choose feasible goals for the LLC. The LSTM’s predictions are integrated into the HLC framework as part of the state information, aiding the HRL model in making more informed decisions, especially during the early stages of training when it has not yet learned to effectively predict battery usage.
The critic loss function for the HLC using SAC is defined as:
where
and
is the replay buffer for HLC, containing transitions over a slower time scale through sequences of goal selections. Moreover, the actor loss function for the HLC is given by:
The LLC is tasked with implementing objectives delineated by the HLC, operating autonomously to fulfill these specified goals. It is presumed that the goals set by the HLC are optimal, guiding the LLC’s policy to adapt its actions toward efficient goal achievement. The LLC’s primary responsibility includes navigating the aerial robot toward its objectives while managing its battery efficiently and navigating around obstacles, embodying a direct application of the high-level strategy to operational actions. The HLC objectives for the LLC persist until the LLC either fulfills these objectives or encounters conditions that necessitate an alternative approach, as mentioned before.
We enhance the decision-making process of the LLC by integrating LSTM-based predictions of battery consumption into the actor network’s objective function. This integration allows the LLC to make more informed decisions that not only aim to achieve operational objectives but also optimize for energy efficiency. The key idea is to leverage the LSTM’s capability to predict the battery consumption associated with different actions in given states, thus enabling the aerial robot to prefer actions that are energy efficient. The LLC optimization incorporates LSTM predictions to adjust actions for energy efficiency and LLC Objective Function integrates the LSTM’s prediction of battery consumption, adjusting the SAC objective to balance mission success, policy entropy, and predicted battery efficiency. Thus, the LLC actor loss function is defined as:
where
represents the predicted battery consumption cost for acting
in state
, predicted by the LSTM.
is a weighting factor that balances the importance of energy efficiency (battery usage optimization) against other objectives within the LLC’s loss function. This formulation explicitly integrates LSTM-based predictions into the LLC’s objective function, allowing for the aerial robot to make informed decisions that optimize both operational objectives and energy efficiency.
Critic loss for LLC is defined as follows:
The architecture and hyperparameters of the proposed hierarchical reinforcement learning model are detailed in
Table 2.
4.2.3. Switching Mechanism in the HLC
In our proposed framework, the HLC initially leverages LSTM predictions for battery consumption to inform its goal-setting and decision-making processes. This reliance on LSTM outputs facilitates early-stage learning, enabling the HLC to make informed decisions with limited experience. As the HLC interacts more with the environment, it starts learning from the environment experiences. It gradually understands the dynamics of battery consumption in relation to various actions and environmental conditions. With increasing knowledge, the reliance on the LSTM estimator decreases.
Therefore, as training progresses, our framework incorporates an adaptive switching mechanism that dynamically adjusts the reliance on LSTM predictions based on their accuracy compared with the model’s internal decision-making processes. The HLC gradually transitions to a more autonomous decision-making model when its internally generated predictions or decisions consistently exhibit lower errors compared with the LSTM’s predictions. This transition underscores the HLC’s capability to internalize and surpass the LSTM model’s predictive accuracy through continuous learning, ultimately achieving enhanced autonomy and efficiency in mission planning and execution.
This shift is governed by comparing the temporal difference (TD) error, , against the LSTM prediction error where is the number of instances or episodes considered for the evaluation, is the LSTM model’s predicted battery level for the th instance, and is the corresponding currently observed battery level. The policy switch occurs when and episodes have passed.
The TD error is defined as . Here, is the reward at a time ; is the value function of the next state, estimated by the critic network, representing the expected return; is the action-value function output by the critic network for the current state-action pair, indicating the expected return of taking action in the state; and is the discount factor, weighing the importance of future rewards. This adaptive mechanism ensures that the HLC enhances its autonomy and efficiency in mission planning, optimizing energy usage based on real-time learning and experiences.
4.3. Proposed HRL Modeling
4.3.1. HLC Network
The state space of the HLC is designed to navigational and operational metrics, essential for optimizing aerial robot mission strategies. We define the state space
as:
where
quantifies the Euclidean distances to each mission target and the recharge station from the aerial robot’s current coordinates, enhancing route planning efficacy, and
represents the bearings from the aerial robot to these points of interest, crucial for directional guidance. In addition, the current battery level
, and the payload weight
, directly influence the aerial robot’s flight dynamics and energy consumption. The aerial robot’s velocity vector
and the predicted battery requirements
for reaching each target, as predicted by the LSTM model, complete the state space definition, facilitating informed, energy-efficient decision-making.
The action space
for the HLC is defined as:
where
denotes the selection action for the next target point, and
specifies the anticipated battery charge required to reach the selected target. This action space enables the HLC to dynamically choose between progressing toward mission targets or recharging, contingent upon the current operational state and energy requirements. In addition, to enhance the reliability of our proposed framework, we incorporate a safety margin into the battery consumption prediction model. This safety margin accounts for uncertainties, such as unexpected maneuvers to avoid obstacles, which might impact the aerial robot’s energy consumption. After the initial 1000 episodes of training, this safety margin is manually adjusted based on observed performance and environmental interactions.
The reward function for the HLC, , is designed to encapsulate this layer’s objectives, promoting actions that enhance mission success while conserving energy and ensuring safety:
Target Achievement Reward (): The primary component of the HLC’s reward function is designed to incentivize the selection of reachable targets and efficient resource utilization:
where
represents the base reward for targeting a new goal.
is the Euclidean distance to the selected target, and
denotes the predicted battery usage to reach the target. Moreover,
and
are scaling factors that adjust the influence of distance and battery usage on the reward.
is a traveled distance by the aerial robot, which helps choose the optimal path. In scenarios where a target is deemed not reachable due to goal feasibility, a high penalty of
is applied to discourage the selection of targets.
Successful Delivery Reward (): Upon successful delivery of a payload to a target, the HLC is awarded a significant reward ( =) for each target.
Mission Efficiency Bonus (): To further incentivize the completion of the mission with minimal energy expenditure, an additional bonus ( =) is awarded if all targets are served without the need for recharging.
Recharging Decision Component (): The HLC’s decision to recharge the aerial robot’s battery is also factored into the reward function:
4.3.2. LLC Network
The state space of the LLC, denoted as
, encapsulates the aerial robot’s operational parameters necessary for executing navigational tasks toward predefined goals. Formally, we define
as:
where
represents the aerial robot’s distance to the target from GPS coordinates;
denotes the orientation toward the goal;
are the velocity components from the IMU,
encapsulates the lidar data for obstacle detection;
is the predicted battery required to reach the goal from HLC; and
is the current battery.
The LLC leverages environmental data acquired through LIDAR distance sensors, characterized by a scan angle range of π radians. the horizontal plane is monitored using seven sensors, with an angular resolution of π/6 radians between adjacent rays. The lidar range consider 50 m. This configuration ensures a detailed spatial awareness, facilitating the aerial robot’s ability to navigate complex environments by detecting and avoiding obstacles.
The LLC’s action space is meticulously defined to accommodate precise control over the aerial robot’s navigational and energy management strategies. The action space, denoted as , is normalized to a range of −1 to 1 for both speed adjustments and yaw angle modifications .
The LLC is responsible for the execution of navigation and operational tasks. The intrinsic reward function for the LLC, , is designed to encourage efficient path execution and safe navigation:
Proximity to Target Reward (): This reward increases as the aerial robot moves closer to the target, encouraging the reduction in the distance to the goal, using the difference in distances between consecutive states, defined as:
where
is the distance to the target from the current state and
is the distance to the target from the next state.
Target Reach Reward (): A reward () given when the aerial robot reaches the target where is a large positive value.
Efficient Path Penalty (): A constant penalty ( = ) for each time step taken to encourage time efficiency in reaching the target where is a small positive constant.
Energy Efficiency Reward (): This reward encourages energy-efficient navigation, given by
where
is the battery at the target from high-level,
and
are the predicted values from LSTM for the state and current battery level, and
is the total battery capacity.
Battery Threshold Penalty (): A penalty (=) is considered if the battery level () falls below a certain threshold where is a penalty reward.
Obstacle Avoidance Penalties ( and ): A safety zone penalty (=) is applied when entering the safety zone around obstacles to encourage maintaining a safe distance. Moreover, a substantial collision penalty (=) for colliding with obstacles is considered, with > to reflect the higher severity of collisions.
4.4. Hindsight Experience Replay
One of the challenges in deploying HRL for aerial robot navigation is the non-stationary nature of the environment, which stems from dynamic changes in the goals set by the HLC. These changes can cause the optimal policy to shift over time, making it difficult for the LLC to consistently achieve its assigned goals.
To address this challenge, we incorporate HER into our framework. HER is particularly adept at mitigating the effects of non-stationarity by allowing the LLC to learn from both successful and unsuccessful attempts, effectively turning failures into valuable learning opportunities. HER enables the LLC to reinterpret previously unsuccessful attempts at reaching a goal as successful outcomes toward alternative goals, fostering adaptive learning from every mission scenario. This approach enhances the LLC’s capability to adjust its strategy in response to changing environmental conditions and mission parameters, ensuring continuous improvement in the operational efficiency and decision-making process
To implement it, we begin with the LLC, denoted here as Algorithm LLC, and ensure the replay buffer is initially empty. At the onset of each episode, an initial state and a goal g are randomly selected from their respective spaces, S and G. During the episode, spanning environment steps , the LLC engages with the environment to produce transitions .
Upon completion of these steps, HER examines the sequence of states traversed, represented as . Utilizing this sequence, HER proceeds to populate the replay buffer with each transition , alongside the initial goal . In a subsequent step, HER enriches the dataset by appending additional transitions , each associated with a new goal where consists of novel goals selected uniformly from the encountered states .
This enhancement process allows HER to offer the LLC agent augmented rewards for each new goal , irrespective of the original goal G’s achievement status. Therefore, HER boosts the LLC agent’s efficiency in learning and its capability to successfully attain goals, broadening the agent’s exposure to a variety of potential scenarios.
When the aerial robot fails to reach its designated goal due to battery exhaustion, HER recalibrates the learning objective based on the actual operational outcome. It identifies the furthest point reached within the battery’s capacity as a new achievable goal . This relabeling process includes updating the experience tuple to reflect the current battery and correlating it with the traveled distance . In instances where the aerial robot’s mission is compromised due to an inability to successfully navigate around obstacles, HER adapts by selecting the point of failure as a new goal . For scenarios where the aerial robot does not fulfill its objective within the allocated number of steps, HER intervenes by marking the endpoint reached within the step constraint as the revised goal .
5. Experimental Setup and Results
In our study, we conducted an experimental analysis to assess the performance of our proposed hierarchical reinforcement learning framework. Our evaluation is twofold: initially, we verified the accuracy of the pre-trained LSTM model, followed by an assessment of our HRL approach’s effectiveness. We compare our model’s performance against standard soft actor-critic and soft actor-critic augmented with HER models and hierarchical actor-critic (HAC) [
65] to showcase the improvements our framework offers.
5.1. Simulation Environment and Setup
Simulations were performed using MATLAB 2023, leveraging a preconfigured aerial robot model from Simscape within the UAV Toolbox. This setup allowed us to accurately replicate the flight dynamics and battery consumption of aerial robots in a simulated
m environment. The aerial robot’s start point is set constant, and the recharging station is located at the starting position. Missions tasked the aerial robot with reaching randomly placed survivor positions in each trial, flying at speeds up to 20
. Details on the aerial robot model can be found in
Table 3.
In our simulation framework, we consider the aerial robot’s height constant during the mission to simplify the problem to a 2-dimensional space. This assumption allows us to focus on optimizing horizontal navigation and battery consumption without the added complexity of 3-dimensional movement.
5.2. Performance of LSTM Predictor
In this section, we detail the outcomes of our experiments with the LSTM-based model for predicting battery consumption. Our goal is to demonstrate the model’s predictive accuracy using the dataset [
66]. The dataset contains energy usage data for 100 commercial aerial robots; these data were documented over 195 test flights. These flights varied in terms of payload, velocity, and elevation. Data collection encompassed distinct attributes, capturing details from the aerial robot’s current battery, GPS, and IMU. We selected the parameter set that minimized the average mean squared error (MSE), identifying it as the optimal hyperparameter for our model.
Table 4 provides the hyperparameters and corresponding evaluations. Based on the results of
Table 4, our analysis selected the hyperparameter set with a 0.5 dropout rate, a 0.001 learning rate, and a batch size of 128 due to its predictive accuracy, as evidenced by the lowest average RMSE and average MSE.
Figure 3 shows the performance and validation of the LSTM based on the corresponding optimal hyperparameters. It illustrates the performance of the proposed LSTM architecture in data prediction.
Figure 4 presents the LSTM model’s performance in comparison to the ground truth data, focusing on the actual aerial robot battery consumption without accounting for wind effects. The evaluation of the model’s accuracy was based on the MSE between the predicted and actual battery levels. The results reveal that the LSTM model achieves high precision on the testing dataset, exhibiting a minimal discrepancy of just 4.503 watts from the true energy consumption.
5.3. Training Result
To evaluate the efficacy of our proposed algorithm in aerial robot path planning, we designed a simulation framework within a MATLAB environment, encompassing a spatial domain of 400 × 400 × 50 m. The simulation environment is dynamically configured for each training episode, with the aerial robot’s initial position randomly chosen from the environment’s corners. This setup is further complicated by the presence of 20 cylindrical obstacles randomly distributed across the space, each with a 7 m radius and a 50 m height, to mimic navigational challenges. Also, the aerial robot should deliver variable payloads (25, 25, 50, 75 g) to four randomly positioned targets.
The high-fidelity simulated environment generates a large and diverse set of training samples by replicating various scenarios. To enhance the robustness and generalization of our model, we employ domain randomization techniques, systematically varying aspects such as target locations.
To ensure learning and convergence, we set the maximum number of training episodes to 10,000, with each episode capped at 500 time steps. The hierarchical decision-making process is governed by a temporal goal-setting interval () of 20 steps to strike a balance between goal feasibility and computational time.
In all simulated environments, we assume that the aerial robot’s battery is sufficiently charged to complete deliveries to all targets without the need for interim recharging provided the operation is executed with energy efficiency. In all the environments, a safety margin of 0.5 around each obstacle is considered.
Our comparative analysis, illustrated in
Table 5,
Figure 5 and
Figure 6, positions our algorithm against SAC, SAC integrated with HER, and HAC with SAC. Uniformity across trials was maintained by considering three random seeds for each algorithm.
Figure 5 illustrates the cumulative average reward, with solid curves representing the mean performance and shaded areas highlighting the variability across different seeds. This graph underscores the superiority of our algorithm, showcasing its excellence in convergence efficiency and path planning effectiveness. The distinct advantage of our algorithm is attributed to a multifaceted approach:
Hierarchical Task Decomposition: Simplifying the complex aerial robot path planning into a structured hierarchy of sub-tasks expedites the learning process. This strategic segmentation allows for quicker adaptation and efficient resolution of navigational challenges.
LSTM-Enhanced Goal Setting: Integrating LSTM networks enables dynamic and realistic goal setting by the high-level policy, considering the aerial robot’s state and immediate environmental context. This not only accelerates convergence but also minimizes the risk of mission failure by ensuring that goals are both relevant and attainable.
Stability through Hindsight Experience Replay: The incorporation of HER plays a crucial role in enhancing the stability of the low-level controller amidst the dynamic goal adjustments from the high level. By reinterpreting past experiences under new goals, HER enables the low-level controller to maintain a consistent learning trajectory, thus significantly bolstering stability and ensuring robustness against the variability in high-level goal setting.
Adaptive Goal Flexibility: HER also aids in adapting the aerial robot’s behavior to sudden changes in goals, ensuring that the learning process remains unaffected by the high-level policy’s dynamic decisions. This adaptability is crucial for maintaining performance consistency across varied and unpredictable scenarios.
Training results show a higher success rate, accelerated convergence, and enhanced stability in aerial robot path planning tasks compared with established benchmarks.
Figure 7 shows the energy efficiency of our proposed method in the training environment in comparison to other methods. The LSTM-based prediction model anticipates future battery requirements, so our system can make informed decisions that optimize energy use. The hierarchical structure of our framework enables control over aerial robots’ missions. This allows for mission planning that inherently prioritizes energy efficiency, from the macro selection of mission objectives to the micromanagement of in-flight maneuvers. The integration of HER into our framework enhances the aerial robot’s ability to learn from past experiences, including previous energy expenditures. This learning process fine-tunes the aerial robot’s decision-making, steering it toward more energy-efficient strategies over time. Moreover, by adapting to the outcomes of past missions, our system continuously refines its understanding of energy-optimal behaviors. A critical aspect of our system’s design is the frequent update mechanism within the hierarchical decision-making process. By adjusting goals and paths at regular intervals based on real-time data and predictive insights from the LSTM model, the aerial robot can make corrections to its flight plan that avert inefficient energy use. This dynamic adaptability reduces the likelihood of scenarios that necessitate regular recharges, ensuring that the aerial robot’s energy reserves are utilized for prolonged operational periods.
5.4. Test Result and Discussion
After training in the training environment, we tested the proposed algorithm against benchmark algorithms and algorithms in the literature, including, SAC, SAC integrated with HER, and HAC in the following environments to evaluate the performance of the proposed algorithm. All test environments have the same dimensions as the training environment.
To evaluate the performance of each algorithm in the test environments, for each environment and algorithm, we observed the following performance metrics: the success rate, which represents the aerial robot’s ability to successfully deliver to all the targets after 2000 episodes; the collision rate, which indicates instances where the aerial robot collides with an obstacle; and the average rewards. Additionally, since the primary focus of this paper is path planning, we primarily illustrate the paths generated by the proposed method.
Table 6 presents the results for each environment, comparing our proposed method to other algorithms in terms of success rate, collision rate, and average rewards.
Environment 1: In this scenario, the aerial robot maneuvered through 50 cylindrical obstacles, while delivering uniform payloads of 25 g to each of the three targets and there was no need to recharge. The path generated by the proposed HRL is shown in
Figure 8. The proposed HRL algorithm recorded a success rate of 90.1%. This level of performance highlights the algorithm’s exceptional ability to navigate through densely populated obstacle spaces with robust obstacle avoidance and energy management capabilities. Additionally, a lower collision rate of 8.1% attests to its precision in ensuring safe navigation. The average reward metric further underscores the overall mission efficiency of the algorithm. Notably, the proposed HRL algorithm maintains a high success rate even in environments laden with a greater number of obstacles, whereas other methods exhibit a more pronounced decline in success rates. Moreover, it gives the lowest loss rate among the compared methods.
Environment 2: In this scenario, the aerial robot maneuvered through 30 cylindrical obstacles while delivering uniform payloads of 25 g to each of the three targets. As the obstacle density decreased in the second environment, the proposed HRL algorithm not only maintained but also improved its success rate to 95.3%. This improvement reflects the algorithm’s adaptability to varying environmental complexities, showcasing an enhanced ability to optimize routes and manage payloads effectively. The collision rate further dropped to 3.5%, indicating an even more pronounced advantage in navigating with safety and precision. The generated path by the proposed method in Environment 2 is shown in
Figure 9.
Environment 3: In this configuration, the experiment incorporated 30 cylindrical obstacles, maintaining a uniform payload and featuring five targets.
The introduction of more targets presented a more complex challenge, yet the proposed HRL algorithm continued to excel with a 93.6% success rate. This environment tested the algorithms’ capacity to manage additional mission objectives without compromising efficiency or safety. The proposed HRL’s success in this scenario emphasizes its effective goal prioritization and energy utilization strategies, crucial for handling multiple objectives. The generated path by the proposed method in Environment 3 is shown in
Figure 10.
Environment 4: Similar to Environment 3, this setting encompassed 30 cylindrical obstacles with a uniform payload distribution, yet with an increased target count of 7. The proposed HRL algorithm achieved an 89.4% success rate, the highest among the compared methods, demonstrating its resilience and strategic planning prowess in highly complex scenarios. Although the success rate saw a slight decline from previous environments, the algorithm’s consistent performance in terms of both collision rate and average rewards highlighted its robustness and the effectiveness of its hierarchical decision-making framework. The generated path by the proposed method in Environment 4 is shown in
Figure 11.
Based on
Table 6, the proposed HRL method consistently achieves higher success rates across all test environments, highlighting its robustness and efficiency. Specifically, it achieves a success rate of 90.1% in Environment 1, significantly outperforming SAC (59.2%), HAC+SAC (81.1%), and SAC+HER (75.4%). This trend continues in Environment 2 with a success rate of 95.3% and in Environments 3 and 4 with success rates of 93.6% and 89.4%, respectively. These results demonstrate the method’s superior adaptability and effectiveness in varied and complex scenarios.
Our method also exhibits lower collision rates compared with the benchmark algorithms. In Environment 1, the collision rate is 8.1%, compared with SAC (24.6%), HAC+SAC (9.4%), and SAC+HER (12.3%). This improvement persists across all environments, with collision rates of 3.5% in Environment 2, 3.8% in Environment 3, and 6.4% in Environment 4. These results underscore the method’s effectiveness in ensuring safe navigation and obstacle avoidance.
The average rewards achieved by our method are significantly higher in all test environments. For instance, in Environment 1, the average reward is 26.7, compared with 15.90 for SAC, 24.33 for HAC+SAC, and 21.37 for SAC+HER. This trend continues with rewards of 28.43 in Environment 2, 46.48 in Environment 3, and 62.32 in Environment 4. Higher average rewards indicate more efficient path planning and energy utilization, validating the advantages of our HRL framework with LSTM-based battery prediction.
Across all test environments, the proposed HRL algorithm consistently outperformed SAC, HAC+SAC, and SAC+HER in success rate, collision avoidance, and average rewards. This superior performance can be attributed to the algorithm’s integration of LSTM-based energy consumption prediction, adaptive goal setting, and an advanced learning mechanism that dynamically optimizes path planning and energy management in real time. The success rates of our algorithm diminished less compared with the other models under testing. Furthermore, our results indicate that our proposed algorithm achieved faster convergence with fewer steps required, an essential factor in time-sensitive SAR operations.
In reinforcement learning literature, achieving a success rate of approximately 90% in dynamic and obstacle-rich environments is considered highly effective, and as the results show, the effectiveness of our approach is evident [
56]. The integration of LSTM for energy prediction and HER for enhanced learning efficiency significantly contributes to the high performance of our method. Our approach not only demonstrates superior adaptability and efficiency but also maintains robustness across varied scenarios, which is crucial for practical applications in search and rescue missions.
6. Conclusions
This study presents a hierarchical reinforcement learning framework, enhanced with a long short-term memory-based battery consumption prediction model, for optimizing aerial robots’ operations in post-disaster scenarios. The integration of LSTM into the HRL framework has been a pivotal advancement, enabling more accurate battery usage predictions and improving decision-making processes at both high and low levels of the control hierarchy.
Our experimental results, obtained through simulations using MATLAB 2023 and a combination of Simscape and UAV Toolbox, demonstrate the superiority of our proposed framework over traditional HRL and benchmark soft actor-critic architectures. The LSTM-augmented HRL model exhibited improvements in mission success rates, energy efficiency, and adaptability to environmental changes, particularly in scenarios involving variable payloads and increased obstacle density. These enhancements are crucial for aerial robots’ operations in disaster-stricken areas where efficient resource management and flexible response to unforeseen challenges are essential.
Furthermore, the proposed framework showed a marked increase in endurance and operational efficiency, with aerial robots capable of longer flight times and reduced battery consumption. This efficiency is vital in emergency scenarios where extended aerial robot operation can be critical for successful mission outcomes. Additionally, the framework demonstrated superior collision avoidance and path planning capabilities, further underscoring its applicability in complex and unpredictable environments.