1. Introduction
Recently, the use of unmanned aerial vehicles (UAVs), known as drones, has been receiving extensive attention for several applications. The flexibility, agility, and mobility features of UAVs allow them to achieve tasks that are difficult for humans such as monitoring, surveillance, medical supplies, military, telecommunications, rescue operations, and delivery [
1,
2,
3]. Furthermore, as technology advances, drones gain new capabilities and become increasingly autonomous. Due to the goals of heterogeneity, communication technology, and hardware/software constraints, the internet of drones (IoDs) needs to collaborate to accomplish tasks that exceed their individual capabilities [
4].
The efficient and safe navigation of drones to their intended destinations is hindered by various types of obstacles, both static and dynamic. The presence of these obstacles has a significant impact on the performance of drones, particularly in densely populated areas. Therefore, successfully completing missions in such environments poses a critical challenge, imposing the development of an intelligent and efficient technique that allows UAVs to modify their flight paths in real-time to ensure collision-free journeys towards their destinations. This is considered as a challenging task, especially with dense swarms and various environment constraints [
5,
6].
Numerous approaches have been employed to address UAV path planning, including meta-heuristic algorithms such as the genetic algorithm (GA) [
7], artificial bee colony (ABC) [
8], and particle swarm optimization (PSO) [
9]. However, these approaches often suffer from slow convergence and the problem of getting stuck in local optima. In response to these limitations, researchers have made efforts to enhance the optimality and convergence speed of some meta-heuristics, such as PSO, resulting in the development of improved particle swarm optimization (IPSO) [
10]. Nonetheless, these approaches are primarily effective in static environments and are not suitable for unknown or dynamic environments.
Recently, the use of machine learning, including reinforcement learning, alongside other methods has enhanced the automation of drones and transferred them qualitatively to achieve different goals and objectives. Learning through interactions with the environment forms the core concept on which reinforcement learning is based, and this is precisely what drones require for autonomous navigation [
11,
12,
13].
Despite the massive number of studies addressing IoDs path planning in the literature, most of these studies focus on addressing path planning in static environments where the information of the environment is known in advance and the paths are generated offline. In dynamic environments, information about the environment is not known in advance, and the path is generated online in a drone field of view. To complete these tasks, IoDs require a sophisticated technique for autonomous flight in the area filled with static and dynamic obstacles, which is still challenging [
14]. Additionally, in increasingly complex mission situations with high drone densities, drones face new challenges as they become closer to obstacles or each other. In such situations, IoDs path planning becomes a significant aspect of autonomous navigation [
15,
16].
Another challenge is the inherent constraints that limit the complete realization of drones’ capabilities [
17]. The primary limitation is the lifespan of onboard batteries, which restricts the duration of their flights. Consequently, many applications fall short of achieving their maximum potential. Furthermore, in scenarios involving extensive flight paths across large areas, minimizing the energy consumption of UAVs is essential to establish a safe route and successfully reach their destinations.
To address these challenges, we propose an energy-efficient online static and dynamic collision avoidance framework for IoD formation to tackle the multi-UAV path planning problem in a fully distributed reactive manner.
The proposed approach incorporates environmental information and introduces a novel reward structure that can be applied to various environments. The framework combines IPSO guidance obtained through an IPSO path planning algorithm with local RL-based planning. The local RL-based planner considers the surrounding environmental information, including static and dynamic obstacles, as well as other UAVs, to generate appropriate actions. The objective is to avoid potential collisions while following the fixed IPSO guidance. By integrating IPSO guidance and local RL-based planning, the framework facilitates end-to-end learning in dynamic environments. The local planner utilizes spatial and temporal information within a local area, such as the UAV’s field of view, to make real-time decisions and navigate effectively.
The proposed IPSO guidance approach ensures scalability by enabling the UAV to learn and navigate using a fixed-sized learning model, even in large-scale environments. It provides a consistent reference path for the UAV while allowing the local RL-based planner to adapt and make responsive decisions based on the current environmental conditions. To achieve this, a reward function will be designed to combine and consider the IPSO guidance in calculating a reward.
The rest of this work is organized as follows: in
Section 2, previous related works will be reviewed, covering different techniques that are used in path planning and collision avoidance.
Section 3 explains the proposed energy-efficient online reinforcement learning for IoD path-planning description, thoroughly explaining the methods used and the proposed solution.
Section 4 explains the simulation results and discussion. Finally, the conclusion are presented in
Section 5.
3. Energy-Efficient Online Reinforcement Learning for IoD Path-Planning Description
This section provides a comprehensive overview of the system components and their respective roles. Thus, IPSO, Q-learning and actor–critic DRL will be presented. Additionally, the integration between these components within our proposed solution will be detailed.
3.1. Energy Model for IoD Path Planning
The energy model is essential in the path planning process. Despite their increasing popularity and numerous advantages, drones still face inherent limitations that restrict their full potential. The most significant limitation is the short lifespan of their on-board batteries, which is regarded as their primary drawback [
40]. When we incorporate the energy model into the path planning and navigation algorithms, the framework can optimize the drone’s trajectory to minimize overall energy consumption. By encouraging the drone to follow the IPSO path and providing higher rewards as it approaches the goal, the process enhances energy efficiency. The reward structure motivates the drone to take the most efficient route, minimizing unnecessary movements and thus conserving energy. Thus, when the drone navigates the path, at each step, the energy is calculated, and an amount of reward will be added depending on that energy. To do so, we use the following formula:
where
is the cumulative reward,
is the current reward, and
E is the energy required to reach the goal or the destination. In this work, we apply the energy model applied in ref. [
41] to accurately determine the energy consumption during a drone’s flight mission. The model explains various maneuvering activities described by the energy model. The energy consumption for vertical flight over a distance
h and velocity
is calculated as shown in Equation (
2):
The energy required for vertical descent over a distance
h and velocity
is given by Equation (
3):
The energy required to hover in place from time
to time
is given by Equation (
4):
A horizontal flight as a function of velocity is given by Equation (
5):
where
d is the horizontal distance and
v is the horizontal speed. Measurements were conducted to determine the power and time required for the rotations, where the angular speed was assumed to be
(2.1 rad/s) and a constant power of
(225 W/s) was maintained throughout the rotation. Based on these assumptions, the energy needed to cover an angle
were calculated as shown in Equation (
6):
Therefore, the overall energy can be written in Equation (
7) as follows:
3.2. Planning the Path Using Improved Particle Swarm Optimization
IPSO is a modified version of the standard PSO meta-heuristic approach. The aim is to generate the UAVs’ paths that ensure the shortest and obstacle-free path, which has a significant effect on the mission of the drones in terms of time and energy consumption. It should be noted that in our proposed solution, the IPSO algorithm is applied offline in static environments. Thus, the paths generated by IPSO avoid only static obstacles during the mission. The IPSO incorporates three main enhancements to standard PSO that will be explained along with other components in the next subsections. The main enhancements include improvements in the initialization stage of swarm particles using a chaos map, improvements in the updating strategy using an adaptive mutation strategy, and inactive particle replacement.
3.2.1. Initialization Using the Chaos Map
To improve the initial distribution of particles in the searching space, the IPSO algorithm operates a chaos map logic for the initialization of the particle population. This initialization technique helps improve the convergence speed and quality of the solutions. Chaos-based particles utilize the logistic map to obtain the initialization formation. The simplest logistic map presented in [
10] is applied to IPSO using Equation (
8):
where
n = 0, 1, 2, 3... and
X represent the chaos variable, while
is a predetermined constant called the bifurcation coefficient.
To effectively demonstrate the superiority of logistic map initialization over random initialization, visual representations of 10,000-time iterations in MATLAB are presented in
Figure 1 [
42]. The graphical analysis clearly indicates that the distribution of the logistic chaos map exhibits greater uniformity compared to that of the rand function. This enhanced uniformity contributes to a broader spectrum of potential flight paths for UAVs.
3.2.2. Adaptive Mutation Strategy
The primary goals of particles in the optimization process are searching and convergence to achieve the optimal solution. Therefore, the particle needs to initially search (explore) to enhance the diversity, then convergence will be obtained in the second stage. To address issues like exploration efficiency and premature convergence, this strategy is proposed. The strategy adjusts the mutation rate based on the particles’ fitness values to balance the exploitation and exploration of the searching space. To achieve that, the adaptive mutation strategy uses Equation (
9):
where
t is the current simulation time,
is the velocity, and
selects the particle’s movement speed. A high value of
helps in rapid exploration of the area but may affect the fine-tuned optimization. On the other hand, a low value of
increases the convergence and adjusts the solution. Accordingly, for an effective optimization process, the particles must initially explore extensively and make significant leaps towards searching regions. As the iterations progress, the speed of the particles should decrease to facilitate rapid convergence. Consequently,
should dynamically change with each iteration, as indicated by the following Equation (
10):
where
<
, and they are constant values. The current time and the total simulation time are
t and
MaxIt, respectively.
3.2.3. Inactive Particle Replacement
One main enhancement of the IPSO algorithm is replacing inactive particles with new, fresh particles. Inactive particles are those that have not been able to participate effectively in the searching process. This occurs when the particles lose the ability to search globally or even locally, such as when they fail to discover a better position, leading to premature convergence. Therefore, by replacing them with fresh ones, the searching process will converge towards the global optimum rather than getting stuck in the local optimum. Additionally, the algorithm incorporates other crucial components that contribute to its overall functionality and efficacy. These components will be highlighted to provide a comprehensive understanding of the IPSO algorithm.
3.2.4. Social Acceleration Coefficient
A crucial role is played by the acceleration coefficients in the IPSO algorithm, indicating the weight of the stochastic acceleration. By multiplying these coefficients, c1 and c2, with random vectors, r1 and r2, they introduce controllable stochastic effects on the velocity of the formation of the IoDs. Furthermore, the coefficients represent the weights that are assigned to share the information between particles. For example, if c1 = c2 = 0, the particle will solely rely on its own knowledge. Conversely, if c1 > c2, the particles will manage to gravitate towards local attractors (pBest), while if c2 > c1, it will cause the particles to lean towards the global attractor (gBest). Thus, the choice of c1 and c2 determines the balance between exploration and exploitation within the searching space.
3.2.5. Inertia Weight ()
The optimization algorithm’s performance depends on balancing global and local searching. To accomplish this, the concept of inertia weight is employed to effectively manage the exploitation and exploration in the searching process. To provide further clarity, a large value of
promotes exploration by encouraging the algorithm to explore a wider space, while a small value enables exploitation by focusing on exploiting promising regions. As a result, some studies have proposed adapting
linearly, as demonstrated in [
10]. The linear adaptation of
is presented as shown in Equation (
11):
here,
t represents the current time, while
MaxIt is the higher simulation time.
and
correspond to the minimum and maximum values of the inertia weight. By utilizing this linear adaptation, the
value will change dynamically over time, gradually transitioning from
to
as the simulation progresses. This adaptive approach allows the algorithm to make a necessary balance between exploitation and exploration throughout the optimization process. In light of the aforementioned considerations, it can be generally stated that the IPSO algorithm consists of many parts. It starts by initializing all parameters, creating the initial position, velocity, and solution of all the particles. After that, the process of planning the path will be carried out by obtaining the parameters and environment constraints to return feasible paths to the destination with corresponding average fitness. The improvements mentioned for the PSO are added to increase the search efficiency. Furthermore, as discussed previously, adaptive adjustment of
and
occurs at each iteration. At the end of each iteration, the trail of each particle is evaluated. If the trail exceeds the threshold, the inactive particle will be replaced with a refreshed one. Then, the algorithm will be repeated until an optimal solution is attained or the time limit is reached.
3.3. Q-Learning Algorithm
The Q-learning algorithm is an RL algorithm that focuses on how an agent can learn from the environment by making sequential decisions that maximize the cumulative rewards. RL has undergone significant advancements and refinements. The Q-learning algorithm is a model-free and off-policy RL method. The agent of Q-learning is a value-based agent trained to estimate the return or future rewards during the learning process. Thus, the agent will select and output the action for which the greatest return will be obtained using the concept of a Q-value. The Q-value is the content of the Q-table; the rows and columns of the Q-table represent the states and actions. Initially, all the Q-values are random or zeros. Then, the Bellman equation is employed to obtain the Q-value. The max Q-value representing the best action should be selected for a particular state. Thus, Q-learning as an RL algorithm needs to use the sequential behavior decision problem, which is defined by the Markov decision process (MDP). The process of the Bellman equation is formulated by using the MDP and the value function. In the next subsection, a detailed explanation will be provided for the Markov decision process, the value function, and the Bellman equation, which form the foundation of RL algorithms.
3.3.1. Markov Decision Process (MDP)
The MDP captures the idea of the agent interacting with the environment over several time steps. At each time step, the agent takes an action A after observing the current state of the environment S and accordingly will obtain reward R [
43,
44]. The environment of the agent is probabilistic, meaning that after the action is taken, the transition of the state and the compensation are random. Policies are selections of actions that should be performed in a specific state, which can be formulated using MDP [
45]. The MDP consists of the following components:
State: The environment’s state at time step t denoted by St, which represents the state of the environment at that moment. The agent has access to a set of observable states, denoted as S, which refers to the observations made by the agent regarding its own situation [
46].
Action: The action refers to a set of possible actions denoted as A within a given state S. The action At is the action that occurs at time step t. Typically, the agent’s available actions remain the same across all states. Thus, a single representation can be used to indicate the set of actions A [
47].
Policy: Policy in RL can be defined as the behavior of the agent at a given time. In other words, the policy converts the perceived states to actions to be taken when in those states. Policy can be a complicated searching process that needs significant computational resources, or it can be as simple as a lookup table or a small function. The policy represents an essential part of RL, in the sense that the agent cannot determine the behavior without it. In general, the policies specify the probabilities for each action [
12]. The
in Equation (
12) is the policy probability that the agent may choose an action a in a state s at time step t.
Reward: The goal of an RL problem can be defined by the reward signal. The environment will give the RL agent numerical value called the reward. Therefore, the agent’s role is to try to maximize the total reward that receives over running time. Thus, the reward signal R defines what are the good and bad actions the agent can make.
Discount factor: The discount factor or discount rate is denoted by
. Its value is between 0 and 1, determining the importance of future rewards compared to immediate rewards. In each state, when the agent takes an action, it will receive a reward as compensation [
12]. The discount factor allows the agent to balance between immediate rewards and delayed rewards by enabling it to make decisions that align with either short-term or long-term objectives based on the chosen value of
. The choice of discount rate depends on the specific problem and the agent’s goals. A lower discount rate may be suitable for situations where immediate rewards are of importance, while a higher discount rate may be more appropriate when long-term cumulative rewards are the primary focus [
48].
Value Function: The value function is a key concept in RL. It provides the prediction of the expected future reward that can be achieved if the agent follows a particular policy. It measures how each state or state–action pair performs [
49]. The value function is linked to the Bellman equation, describing the relationship between the state’s value and the values of its neighboring states [
50].
3.3.2. Bellman Equation
Q-learning algorithms use the Bellman equation as an exploitation–exploration strategy to update the Q-values and to improve the policy over time. Q-learning iteratively updates the Q-values, which represent the expected cumulative reward for taking a particular action in a given state. After using the Bellman equation to derive the Q-values, the best action is selected based on the max Q-value for a particular state [
50]. Equation (
13) is simplified to obtain the Q-value [
45], while Equation (
14) uses the temporal difference:
Table 2 below provides more clarity about the equations.
3.4. Deep Reinforcement Learning (DRL)
After providing an overview of Q-learning, which is one of the traditional RL algorithms, we explore the advancements made in the field of deep reinforcement learning (DRL). DRL combines the principles of RL with the powerful function approximation capabilities of deep neural networks, allowing for the efficient learning of complex, high-dimensional environments. Thus, the key advantage of DRL is its ability to handle large state and action spaces like what we have in path planning in a 3D environment, which are often present in real-world problems. Deep neural networks can effectively learn to represent the value function or the policy, enabling the agent to make informed decisions in complex environments [
51]. An alternative approach in DRL is the actor–critic model, which combines two methods, policy-based and value-based methods, to utilize the strengths of both. The next subsection provides a detailed explanation of the actor–critic model.
Actor–Critic Model
The actor–critic model is a method that combines the strengths of value-based methods like Q-learning and policy-based methods like policy gradients to help the agent learn both the value function and the policy. This approach can often lead to more stable and efficient learning, as the critic provides valuable guidance to the actor. The actor can focus on exploration and policy optimization, while the critic provides a stable and informative value function to guide the actor’s updates [
12,
51].
Neural Networks: When dealing with complex problems, the Q-table used in reinforcement learning needs to store a very large number of states and actions. This can result in a significant slowdown in the training speed and reduce the overall effectiveness of the approach. Path planning and obstacle avoidance are examples of complex problems of this nature. To address this, a neural network is used to calculate values rather than relying solely on the Q-table. By using the DRL, the system can handle the large number of states and actions required for the complex obstacle avoidance problem without the performance issues associated with a Q-table of that scale [
27]. In DRL, the neural networks are used as the function approximators for both the policy and value functions. The specific neural network architectures can vary depending on the problem domain [
52]. The actor–critic model involves two main components:
Experience replay memory: Experience replay memory is a popular technique used in deep Q-learning (DQL). It helps to reduce the correlation between samples, which improves the efficiency of the learning process [
52]. There are some limitations of DQN that experience replay memory addresses:
Experience replay memory refers to the storage of the agent’s past experiences (state, actions, rewards, next states) in a replay buffer. During training, the agent periodically samples a batch of experiences from this buffer and uses them to update the neural network parameters rather than just using the most recent experience. The main benefit of using such memory is that by sampling the experiences randomly from the buffer, it breaks the correlation between consecutive samples that occurs during learning. This can lead to more stable and efficient learning. Also, the reuse of past experiences allows the agent to learn more from the number of interactions with the environment. Additionally, replaying the past experiences helps the agent retain knowledge about earlier parts of the task, preventing it from forgetting important information. Therefore, the combination of the actor–critic method and experience replay can lead to powerful and data-efficient training of the agent [
51,
52].
3.5. Proposed IPSO-DRL System Model
In our proposed solution, IPSO and RL-based methods are integrated to formulate the IPSO-RL framework. The system model of the proposed framework is illustrated in
Figure 2. The IPSO and DRL components are described above. The construction of this framework considers the surrounding environmental information, including static and dynamic obstacles, as well as other UAVs, to generate appropriate actions while considering energy consumption. One key feature of our proposed framework is scalability, even in large-scale environments, by utilizing the IPSO guidance, enabling the UAV to learn and navigate using a fixed-sized learning model. By incorporating IPSO guidance, the framework maintains its efficiency and effectiveness in complex environments. The RL-based planner utilizes spatial and temporal information within a local area, such as the UAV’s field of view (FoV), as shown in
Figure 2, to make real-time decisions and navigate effectively. To drive the learning-based planning process, a reward function is designed to combine and consider the IPSO guidance in calculating the reward, motivating the UAVs to explore various potential solutions while encouraging convergence in the learning-based planning process. The FoV observation, as well as IPSO guidance, are utilized as inputs to the CNN that learns from current observation, IPSO guidance, and historical information, predicting an appropriate action.
The process begins by initializing the environment, which involves creating a random map containing static obstacles, dynamic obstacles, and drones, which represents the RL agents. Next, the start and end coordinates are established. The energy-efficient path planning conducted using the IPSO algorithm ensures minimum energy consumption and avoids static obstacles. The IPSO path serves as a consistent reference for the drone, allowing the RL-based planner to adapt and make responsive decisions based on real-time environmental conditions. After the establishment of the path, the drone navigates the environment using an RL-based planner—in our case, Q-learning or DRL—to reach its goal by generating appropriate actions. The drone receives positive rewards for following the IPSO path, making this path preferable to others. Conversely, if the drone collides with any obstacles or goes outside the boundary, it incurs a negative reward or penalty. Additionally, the drone earns positive rewards as it follows IPSO path or gets closer to the goal, encouraging its learning process. The highest reward is given when the drone successfully reaches the goal, to reinforce the correct completion of the task. Energy consumption is a critical factor in drone path planning; therefore, we provide higher rewards when the drone follows the IPSO path to encourage the drone to use it. Also, for each step, the drone will receive a reward that corresponds with its approach to the goal. This process will lead to enhanced energy efficiency.
Figure 3 shows the flowchart of the IPSO-RL framework.
3.5.1. Reward Function Structure
The reward function of the proposed approach is designed to manage the movement of the drones within the environment by calculating the reward that helps the agent navigate around obstacles effectively to reach the destination. To accomplish this, this part of the proposed solution is organized into functions and helper methods that collectively constitute the goal of the reward operation. The helper methods include methods for calculating the distance and another method for the dictionary to return rewards based on different cases. The drone will face several cases while navigating toward the goal, like obstacles or the IPSO path. These cases are carefully identified and processed to ensure the best learning for the drone. These cases are treated in this sequence with related rewards and processes:
If the drone reaches the goal, a large positive reward is defined.
If the drone collides with obstacles or goes out of bounds, a large negative reward is defined.
If the drone reaches one of the locations on the IPSO path, a large positive reward is defined, as well as more rewards for being near the goal. This reinforces the drone to follow the IPSO guidance.
If the drone reaches a free cell that is not part of the IPSO guidance, a small negative reward is defined. This discourages deviation from the reference path. Also, this small negative reward is subjected to change depending on the distance from that point to the goal.
After setting up the environment and implementing the necessary functions, the training process commences using two distinct approaches: Q-learning and actor–critic DRL.
3.5.2. Q-Learning Approach
For the Q-learning approach, we trained the drone by iteratively updating the Q-values based on the agent’s interactions with the environment. This required the agent to observe the current state, select an action, receive a reward, and then update the Q-value estimate accordingly. The goal of this training was to enable the agent to learn the optimal action–value function, which would guide the drone’s decision-making process. Algorithm 1 presents the pseudocode that we use in our approach.
In contrast, we use the actor–critic model as our DRL approach, utilizing a more sophisticated neural network architecture. This method consists of two main components: the actor network and the critic network. The actor network is responsible for selecting the most appropriate actions based on the current state, while the critic network evaluates the quality of those actions and provides feedback to the actor network. Through this iterative process of action selection and value estimation, the drone is able to learn a more complex and effective policy for navigating the environment. Algorithm 2 is the pseudocode that explains the general structure of the training process of the IPSO-DRL framework.
Algorithm 1: Pseudocode of Q-learning of the IPSO-RL framework |
|
Algorithm 2: Pseudocode of actor–critic of the IPSO-DRL framework |
|
4. Simulation Results and Discussion
To evaluate our proposed framework for planning the path and avoiding obstacles with minimum energy consumption, we carried out a simulation in a dynamic environment.
In the following subsections, we explain the relevant parameters and discuss the results.
4.1. Parameter Settings
The hardware and software specifications are illustrated in
Table 3.
We use the epsilon decay technique to balance the agent exploration and exploitation. The parameters setting are listed in
Table 4.
The environment is simulated with static obstacles and dynamic objects, where drones navigate to plan the path. Our environment is 8 × 8 × 4. It contains three static obstacles and two dynamic objects, in addition to our drones, which are the RL agents. The speed is assumed to be 10 m/s for the drones and 1 m/s for dynamic obstacles.
Finally,
Table 5 clarifies the parameters used in training the DRL approach.
4.2. Simulation Results
In this section, we present the simulation results for our proposed IPSO-RL framework. We conducted several comparative analyses to demonstrate the validity and effectiveness of the proposed method. First, we examined the outcomes of integrating the IPSO algorithm with the Q-learning method and analyzed the effects of incorporating the energy model into the framework by comparing the results with and without the energy model. Second, we further expanded our analysis by comparing the performance of our IPSO-RL framework against the DRL approach. This comparison helps to evaluate the advantages and trade-offs of using the IPSO-RL framework versus a more advanced DRL technique. Finally, we compared our findings from the IPSO-DRL framework against a recent study in the field. This comparative analysis provides valuable insights into how our proposed IPSO-DRL framework fares against state-of-the-art techniques in the literature.
4.2.1. IPSO-RL results
To achieve optimal performance, we examined various sets of parameters across 50,000 episodes. When we identified promising results, we extended the range and conducted additional training with those parameters to confirm their efficacy.
Figure 4 depicts the drone’s performance during the training process. We experimented with different
values of 0.99, 0.5, and 0.1, in addition to the decaying technique. Moreover, we tested
values of 0.0001, 0.001, 0.01, and 0.1 and
values of 0.99 and 0.
4.2.2. DRL Results
First, we used our proposed framework to train the drone for 250,000 episodes using Q-learning under two conditions, one with the energy model included and one without. By monitoring the reward amount during training, we are able to observe the learning progress and the differences between the two approaches. When the energy model was included, we observed a significant improvement in the drone’s performance after approximately 15,000 episodes. The cumulative reward steadily increased, indicating continuous learning and enhancement, and it became relatively stable after approximately 30,000 episodes, indicating that the drone had reached an optimal level of learning and performance.
Figure 5 shows a smaller scale of the learning process to visualize the interval where the gradual increase of the reward occurs, while
Figure 6 shows the cumulative reward obtained by the drone throughout the whole learning process. The reward amount shown in the figures indicates that the drone was able to learn to operate in a more energy-efficient manner.
Figure 7 shows the energy consumption during the learning process and how it decreases significantly as the number of episodes increases, indicating that the drone is able to learn and optimize its energy usage over time.
In contrast, without the energy model, the improvement in performance is much slower. The improvement was noticed after approximately 150,000 episodes.
Figure 8 shows where the exponent of the learning process starts. It begins after approximately 135,000 episodes and reaches its optimal after 155,000 episodes.
Figure 9 shows the cumulative reward curve without using the energy model for all episodes.
We also analyzed the average reward curve, which provides an overall measure of the drone’s learning progress. With the energy model included, the moving average reward curve showed clear improvements over time, as depicted in
Figure 10, demonstrating that the drone is able to learn more effective and comprehensive behaviors.
Without the energy model, the improvement in the moving average reward curve was more gradual and less apparent in comparison, as illustrated in
Figure 11.
Next, we compared the performance of both the Q-learning RL (IPSO-RL) with the actor–critic DRL (IPSO-DRL). We can notice the rapid improvement of the IPSO-DRL model, shown in
Figure 12, compared with the IPSO-RL model. Also, the upward trend in the average reward over the training indicates that the DRL model can learn more effectively within fewer episodes.
In contrast, the IPSO-RL plot (
Figure 6) displays a more gradual improvement in the reward curve. While the IPSO-RL approach still demonstrated a positive learning trend, the magnitude of the rewards achieved was generally lower compared to the IPSO-DRL.
This difference in performance can be attributed to the inherent strengths of DRL techniques, such as their ability to effectively capture complex relationships and patterns in high-dimensional state spaces. The use of deep learning in the IPSO-DRL approach allows the agent to learn more expressive and efficient representations of the problem, leading to faster convergence and higher-quality solutions.
Finally, we compared our result of IPSO-DRL with a recent study. The study suggests using an improved DRL technique to solve the path planning problem in dynamic environments, employing the Q-function approximation of the prioritized experience replay D3QN to estimate the agent’s action Q-values. The algorithm’s network structure combines a competitive neural network architecture with double Q-learning and prioritized experience replay. The simulation of the study was implemented using the TensorFlow platform. The environment includes static obstacles that the drone is trained to avoid; after that, the same model is used to train the drone with the dynamic obstacles. The dynamic obstacles are added randomly to the environment, and their movements are randomly generated as well. The reward function penalizes collisions with obstacles and rewards progress towards the goal. The start and goal positions for the drone are randomly generated within the environment.
Figure 13 depicts the reward curve for the study. We can notice the improvement in the DRL technique for dynamic path planning. However, the reward curve depicted in these results, although improving over time, stabilizes at a lower cumulative reward compared to our framework. We can notice that our framework is able to reach a much higher level of cumulative reward compared to the previous study, highlighting a superior learning capability of the IPSO-DRL framework.
Through these comprehensive comparisons, we aimed to thoroughly assess the merits of the IPSO-RL framework and its ability to address the complexities involved in the problem at hand.
4.3. Discussion
This study highlights that the proposed IPSO-DRL framework significantly generates online energy-efficient paths for IoDs in dynamic environments compared to other existing methods. Furthermore, the proposed IPSO-RL framework acquires a feasible and effective route successfully with minimum energy consumption. The reason behind this can be attributed to the adaptive learning capabilities of the solution that integrates IPSO with RL. A significant improvement in the drone’s performance, with the cumulative reward steadily increasing when the path is optimized for energy consumption minimization, indicated continuous learning and enhancement. This suggests that the drone had reached an optimal level of learning and performance. The energy consumption during the learning process significantly decreased as the number of episodes increased, indicating that the drone learned to optimize its energy usage over time. Moreover, the performance comparison of Q-learning with the actor–critic DRL models showed a much earlier improvement in the IPSO-DRL. The superior performance of the IP-SO-DRL model is attributed to its ability to capture complex relationships in high-dimensional state spaces through deep learning, leading to faster convergence and higher-quality solutions. Furthermore, our IPSO-DRL framework outperforms the recent study on dynamic path planning, demonstrating superior learning capabilities and achieving higher results. Unlike previous approaches that rely on static environments, our method dynamically adjusts paths in response to surrounding environmental changes, resulting in a more efficient solution. These findings suggest that implementing IPSO-RL in IoD systems leads to more sustainable and cost-effective operations, especially in energy-constrained environments.
The main limitation of our study is that it assumes all UAVs are flying at a fixed altitude. This choice was made to ensure a consistent resolution in applications that require the collection of useful information from the environment. Thus, this framework can not address different altitudes of IoD formations.
5. Conclusions
In this work, we proposed a novel approach for planning collision-free paths for UAVs operating in unknown, dynamic, 3D environments. By integrating the improved particle swarm optimization (IPSO) algorithm with reinforcement learning (RL), the developed IPSO-RL framework could generate energy-efficient and collision-free paths for drones in real-time. The proposed framework offered a flexible and adaptive solution for UAV path planning, capable of responding to dynamic environmental changes, thus addressing the limitations of traditional heuristic methods designed for static environments. Additionally, incorporating an energy model into the RL reward structure allowed the drone to optimize the energy of paths, leading to more sustainable and enduring drone operations. We conducted simulations to evaluate our proposed solution, training the drone for 250,000 episodes using Q-learning and actor–critic algorithms within the IPSO-RL framework. We tested the framework under two conditions: with and without the energy model included. The extensive simulations highlighted the superior performance of the IPSO-DRL approach compared to other benchmarks in terms of collision avoidance, path length, and energy consumption. By addressing the critical challenges of collision avoidance and energy efficiency, this research advances the state-of-the-art in autonomous UAV navigation. The proposed framework represents a significant step towards enabling safe and sustainable UAV operations in complex, dynamic environments.
In future work, we will considered variable altitudes of IoDs, requiring a sophisticated reward function. We will also optimize the UAV system to incorporate weather conditions and other constraints.