Reinforcement Learning-Based Formation Pinning and Shape Transformation for Swarms
Abstract
:1. Introduction
2. Model Description and Problem Formulation
2.1. Creation of Boids Model
2.2. Virtual Leader-Based Pinning Algorithm
- Initialization: Set the initial positions and velocities of all drones in the swarm, ensuring a suitable distribution across the desired workspace and define the desired trajectory or path for the swarm to follow, considering any specific objectives or constraints. We also need to determine the parameters for communication and coordination among the drones, such as the range and frequency of wireless communication and the method for exchanging information between drones. These parameters facilitate practical cooperation and synchronization within the swarm.
- Virtual Leader Update: Assuming that there are N drones forming a swarm, where the position of each drone is represented by and the position of the virtual leader is represented by . The traction algorithm can calculate the position of the virtual leader using the following formula [29]:The above formula calculates the position of the virtual leader by taking a weighted average of the coordinates of all drones. By continuously updating the position of the virtual leader, other drones can adjust their behavior based on the motion of the virtual leader, enabling coordinated movement within the drones.
- Communication and Coordination: We need an information exchange that allows drones to adjust their trajectory and align with virtual leaders. One commonly used algorithm is the Distributed Average Consensus algorithm. This algorithm enables the drones to converge towards a typical trajectory by iteratively updating their own trajectory based on the information received from neighbors.Each drone maintains a local estimate of the desired trajectory, denoted as , where i represents the index of the drone and k denotes the iteration step. The update equation for each drone’s local estimate can be expressed as:Here, represents the set of neighboring drones of drone i, represents the weight associated with the communication link between drone i and drone j, and the term represents the difference between the trajectory estimates of drone j and drone i.The weights can be determined based on different criteria, such as distance, connectivity strength, or predefined weights. Common weight assignment strategies include uniform weights, distance-based weights, or dynamically adjusted weights.The consensus algorithm iteratively updates each drone’s local estimate by considering the differences between its estimate and the estimates of its neighbors. This iterative process allows the drones to converge towards a common trajectory.
- Trajectory Adjustment: Let be the position of the drone and be the position of the virtual leader. The desired direction towards the virtual leader is given by:Considering the influence of neighboring drones, where each drone tries to avoid collisions while coordinating its movement, let be the position of a neighboring drone. The repulsion from a neighboring drone is:The overall direction can be calculated by combining the desired direction towards the virtual leader and the repulsion from neighboring drones. This can be a weighted sum depending on the importance of aligning with the virtual leader versus avoiding collisions.
3. Value-Based Reinforcement Learning Methods
- State Space: The state space represents the current state of the swarm system, which includes relevant information about the environment and the swarm’s configuration. It can include parameters such as the positions and velocities of individual drones, the position and velocity of the virtual leader, the distances between drones, and any other relevant variables.
- Action Space: The action space represents the available actions that the swarm system can take at each state. In this case, the action space consists of different combinations of the cohesion parameter and repulsion parameter . Each combination represents a potential configuration for the swarm system.
- Reward Function: The reward function quantifies the desirability or quality of a particular state–action pair. It provides feedback to the swarm system on the goodness or badness of its actions.
Algorithm 1 Q-learning algorithm |
Input: Environment(E); Action space(A); Initial state(); Discount factor(); Learning rate(). Output: Policy() |
|
- Environment (E): The setting in which the agent operates.
- Action space (A): All possible actions the agent can take.
- Initial state : The starting point of the agent in the environment.
- Discount factor : The degree to which future rewards are diminished compared to immediate rewards.
- Learning rate : How much new information overrides old information.
- Policy : A strategy that the agent follows, mapping states to the best action to perform in that state.
- Initialization: : The Q-value for all state–action pairs is initialized to zero.: The policy is initialized to be a uniform distribution, where each action is equally probable if there are actions possible in state x.
- Set initial state: : The agent starts at the initial state .
- Learning Loop: This loop continues indefinitely, iterating over each time step t.
- Action Selection:: An action a is chosen using the -greedy policy from the current policy . This selects the best action most of the time, but occasionally a random action to explore the environment.The agent performs action a, receives a reward r, and transitions to a new state .
- Action Value Update:The Q-value of the current state–action pair is updated using the Bellman equation incorporating the learning rate , the received reward r, the discount factor , and the maximum Q-value of the subsequent state .
- Policy Update:: The policy is updated to choose the action with the highest Q-value for state x.
- State Transition:: Update the state to the new state .
4. Simulation Scenarios and Results
- Obstacle avoidance reward: This component rewards agents for successfully navigating past obstacles. The reward is computed as the sum of the distances among agents when they successfully avoid obstacles. By encouraging agents to maintain a safe distance from obstacles, this component promotes effective obstacle avoidance behavior. Conversely, if agents fail to surmount obstacles, a penalty of −1 is incurred, discouraging collision or unsuccessful navigation attempts.
- Diffusion reward: This component focuses on maintaining a safe distance from walls and adhering to the expansion criteria. When agents maintain a safe distance from walls and meet the expansion criteria, the reward is calculated as the cumulative inter-agent distance. This encourages agents to spread out evenly within the swarm and avoid clustering near walls. On the other hand, if agents make contact with walls or do not meet the expansion criteria, a penalty of −1 is imposed, discouraging undesired behavior such as wall collisions or failure to achieve the desired expansion.
- Exploration in Early Stages: During the initial training phases when is small, we prioritize exploration by assigning a relatively high value to . This exploration-centric approach allows the algorithm to thoroughly explore and understand the environment, facilitating the discovery of potentially optimal solutions.
- Exploitation in Later Stages: As the training progresses and the rewards associated with different combinations of cohesion and repulsion parameters become well estimated, excessive exploration becomes unnecessary. Therefore, we gradually reduce the value of as the number of training sessions increases. To achieve this, we utilize the floor function denoted as , which ensures a smooth reduction in . By decreasing , we shift the focus towards exploiting the learned knowledge, enabling more informed and optimal decision making.
5. Real Robot Experiments
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
81 | 97 | 113 | 124 | 133 | 265 | 173 | 256 | 867 | 647 | 345 | |
159 | 130 | 106 | 88 | 84 | 217 | 1014 | −1 | −1 | −1 | −1 | |
32 | 58 | 65 | 72 | 102 | 124 | 156 | 221 | 268 | 391 | 215 | |
85 | 100 | 127 | 164 | 245 | 630 | −1 | −1 | −1 | −1 | −1 | |
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | 79 | 397 | −1 |
References
- Han, C.; Yin, J.; Ye, L.; Yang, Y. NCAnt: A network coding-based multipath data transmission scheme for multi-UAV formation flying networks. IEEE Commun. Lett. 2020, 25, 1041–1044. [Google Scholar] [CrossRef]
- Li, S.; Fang, X. A modified adaptive formation of UAV swarm by pigeon flock behavior within local visual field. Aerosp. Sci. Technol. 2021, 114, 106736. [Google Scholar] [CrossRef]
- Zhang, B.; Sun, X.; Liu, S.; Deng, X. Adaptive differential evolution-based distributed model predictive control for multi-UAV formation flight. Int. J. Aeronaut. Space Sci. 2020, 21, 538–548. [Google Scholar] [CrossRef]
- Nakagawa, E.Y.; Antonino, P.O.; Schnicke, F.; Capilla, R.; Kuhn, T.; Liggesmeyer, P. Industry 4.0 reference architectures: State of the art and future trends. Comput. Ind. Eng. 2021, 156, 107241. [Google Scholar] [CrossRef]
- Hu, B.; Sun, Z.; Hong, H.; Liu, J. UAV-aided networks with optimization allocation via artificial bee colony with intellective search. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 40. [Google Scholar] [CrossRef]
- Kim, J.; Oh, H.; Yu, B.; Kim, S. Optimal task assignment for UAV swarm operations in hostile environments. Int. J. Aeronaut. Space Sci. 2021, 22, 456–467. [Google Scholar] [CrossRef]
- Pham, Q.V.; Huynh-The, T.; Alazab, M.; Zhao, J.; Hwang, W.J. Sum-rate maximization for UAV-assisted visible light communications using NOMA: Swarm intelligence meets machine learning. IEEE Internet Things J. 2020, 7, 10375–10387. [Google Scholar] [CrossRef]
- Reynolds, C.W. Flocks, herds and schools: A distributed behavioral model. ACM SIGGRAPH Comput. Graph. 1987, 21, 25–34. [Google Scholar] [CrossRef]
- Vásárhelyi, G.; Virágh, C.; Somorjai, G.; Nepusz, T.; Eiben, A.E.; Vicsek, T. Optimized flocking of autonomous drones in confined environments. Sci. Robot. 2018, 3, eaat3536. [Google Scholar] [CrossRef]
- Soria, E.; Schiano, F.; Floreano, D. Predictive control of aerial swarms in cluttered environments. Nat. Mach. Intell. 2021, 3, 545–554. [Google Scholar] [CrossRef]
- Wang, M.; Zeng, B.; Wang, Q. Research on motion planning based on flocking control and reinforcement learning for multi-robot systems. Machines 2021, 9, 77. [Google Scholar] [CrossRef]
- Bai, C.; Yan, P.; Pan, W.; Guo, J. Learning-based multi-robot formation control with obstacle avoidance. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11811–11822. [Google Scholar] [CrossRef]
- Long, P.; Fan, T.; Liao, X.; Liu, W.; Zhang, H.; Pan, J. Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 6252–6259. [Google Scholar]
- Yan, Y.; Li, X.; Qiu, X.; Qiu, J.; Wang, J.; Wang, Y.; Shen, Y. Relative distributed formation and obstacle avoidance with multi-agent reinforcement learning. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 1661–1667. [Google Scholar]
- Sui, Z.; Pu, Z.; Yi, J.; Wu, S. Formation control with collision avoidance through deep reinforcement learning using model-guided demonstration. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 2358–2372. [Google Scholar] [CrossRef] [PubMed]
- Buşoniu, L.; Babuška, R.; De Schutter, B. Multi-agent reinforcement learning: An overview. In Innovations in Multi-Agent Systems and Applications-1; Springer: Berlin/Heidelberg, Germany, 2010; pp. 183–221. [Google Scholar]
- Nguyen, T.T.; Nguyen, N.D.; Nahavandi, S. Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE Trans. Cybern. 2020, 50, 3826–3839. [Google Scholar] [CrossRef] [PubMed]
- Canese, L.; Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-agent reinforcement learning: A review of challenges and applications. Appl. Sci. 2021, 11, 4948. [Google Scholar] [CrossRef]
- Leonard, N.E.; Fiorelli, E. Virtual leaders, artificial potentials and coordinated control of groups. In Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No. 01CH37228), Orlando, FL, USA, 4–7 December 2001; Volume 3, pp. 2968–2973. [Google Scholar]
- Droge, G. Distributed virtual leader moving formation control using behavior-based MPC. In Proceedings of the 2015 American Control Conference (ACC), Chicago, IL, USA, 1–3 July 2015; pp. 2323–2328. [Google Scholar]
- Saska, M.; Baca, T.; Hert, D. Formations of unmanned micro aerial vehicles led by migrating virtual leader. In Proceedings of the 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV), Phuket, Thailand, 13–15 November 2016; pp. 1–6. [Google Scholar]
- Olfati-Saber, R. Flocking for multi-agent dynamic systems: Algorithms and theory. IEEE Trans. Autom. Control 2006, 51, 401–420. [Google Scholar] [CrossRef]
- Rooban, S.; Javaraiu, M.; Sagar, P.P. A detailed review of swarm robotics and its significance. In Proceedings of the 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India, 7–9 April 2022; pp. 797–802. [Google Scholar]
- Kyzyrkanov, A.; Atanov, S.; Aljawarneh, S.; Tursynova, N.; Kassymkhanov, S. Algorithm of Coordination of Swarm of Autonomous Robots. In Proceedings of the 2023 IEEE International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 4–6 May 2023; pp. 539–544. [Google Scholar]
- Holland, J.H. Genetic algorithms. Sci. Am. 1992, 267, 66–73. [Google Scholar] [CrossRef]
- Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
- Ballerini, M.; Cabibbo, N.; Candelier, R.; Cavagna, A.; Cisbani, E.; Giardina, I.; Lecomte, V.; Orlandi, A.; Parisi, G.; Procaccini, A.; et al. Interaction ruling animal collective behavior depends on topological rather than metric distance: Evidence from a field study. Proc. Natl. Acad. Sci. USA 2008, 105, 1232–1237. [Google Scholar] [CrossRef]
- Din, A.; Jabeen, M.; Zia, K.; Khalid, A.; Saini, D.K. Behavior-based swarm robotic search and rescue using fuzzy controller. Comput. Electr. Eng. 2018, 70, 53–65. [Google Scholar] [CrossRef]
- Greenwald, A.; Hall, K.; Serrano, R. Correlated Q-learning. In Proceedings of the ICML, Washington, DC, USA, 21–24 August 2003; Volume 3, pp. 242–249. [Google Scholar]
- Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Guo, J.; Huo, Y.; Shi, X.; Wu, J.; Yu, P.; Feng, L.; Li, W. 3D aerial vehicle base station (UAV-BS) position planning based on deep Q-learning for capacity enhancement of users with different QoS requirements. In Proceedings of the 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), Tangier, Morocco, 24–28 June 2019; pp. 1508–1512. [Google Scholar]
- Arani, A.H.; Hu, P.; Zhu, Y. HAPS-UAV-Enabled Heterogeneous Networks: A Deep Reinforcement Learning Approach. arXiv 2023, arXiv:2303.12883. [Google Scholar] [CrossRef]
Symbol | Definitions |
---|---|
The indexes of drones | |
The position of drone i | |
The velocity of drone i | |
The control input of drone i | |
The distance between drones i and j | |
The undirected perceptual graph | |
The set of vertices | |
The set of edges | |
The neighboring drones of drone i | |
k | The proportional coefficient |
The communication distance among the drones | |
The boundary distance where repulsion and cohesion | |
The cohesive parameter | |
The repulsive parameter | |
The weight associated with the communication link between drone i and drone j | |
The desired direction towards the virtual leader | |
The repulsion from a neighboring drone | |
The state | |
a | The action |
r | The reward |
S | The state space |
The discount factor | |
The probability of taking action a in state s under a probabilistic policy | |
The probability of taking action a from state s and transitioning to state |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dong, Z.; Wu, Q.; Chen, L. Reinforcement Learning-Based Formation Pinning and Shape Transformation for Swarms. Drones 2023, 7, 673. https://doi.org/10.3390/drones7110673
Dong Z, Wu Q, Chen L. Reinforcement Learning-Based Formation Pinning and Shape Transformation for Swarms. Drones. 2023; 7(11):673. https://doi.org/10.3390/drones7110673
Chicago/Turabian StyleDong, Zhaoqi, Qizhen Wu, and Lei Chen. 2023. "Reinforcement Learning-Based Formation Pinning and Shape Transformation for Swarms" Drones 7, no. 11: 673. https://doi.org/10.3390/drones7110673
APA StyleDong, Z., Wu, Q., & Chen, L. (2023). Reinforcement Learning-Based Formation Pinning and Shape Transformation for Swarms. Drones, 7(11), 673. https://doi.org/10.3390/drones7110673