PPO-Exp: Keeping Fixed-Wing UAV Formation with Deep Reinforcement Learning
Round 1
Reviewer 1 Report
This paper proposed a low-communication cost protocol and a variation method of Proximal Policy Optimization for the fixed-wing UAVs formation problem, and the method is verified under the flocking scenario consistent with one leader and several followers. The logic of this paper is relatively clear, and the proposed method is reasonable and innovative. The following questions are for the author's consideration.
1. Please check and correct spelling and sentence errors in this paper.
2. I can’t find the ${\omega}_{i}$, ${{\eta }_{{{\omega }_{i}}}}$, ${{\mu }_{{{\omega }_{i}}}}$ and $\sigma _{{{\omega }_{i}}}^{2}$ in Equation (1), please correct the description.
3. Please correct the incomplete description of Equation (9).
4. Please briefly introduce the PPO-Clip, PPO-KL, TD3, and DDPG, and indicate the references separately in section 4.
5. In section 4.2, I can't get the same conclusion that “in all variations of PPO, the PPO-Exp has the best performance” as you from Fig. 6(a). Please explain in detail.
6. In section 4.2, the conclusion of “the 0.05 is the balance point between exploration and exploitation found by PPO-Exp” confuses me, please explain which one is better, the series ε(t) or ε=0.05.
7. Can the method proposed in this paper be applied to a larger UAVs formation? Please explain in detail.
Author Response
Point 1: Please check and correct spelling and sentence errors in this paper.
Response 1: We checked the paper and revised all the spelling and sentence errors we found.
Point 2: I can’t find the ${\omega}_{i}$, ${{\eta }_{{{\omega }_{i}}}}$, ${{\mu }_{{{\omega }_{i}}}}$ and $\sigma _{{{\omega }_{i}}}^{2}$ in Equation (1), please correct the description.
Response 2: The ${\omega}_{i}$, ${{\eta }_{{{\omega }_{i}}}}$, ${{\mu }_{{{\omega }_{i}}}}$ and $\sigma _{{{\omega }_{i}}}^{2}$ are redundancy, and we delete them in the revised version.
Point 3: Please correct the incomplete description of Equation (9).
Response 3: The ${\omega}_{i}$, ${{\eta }_{{{\omega }_{i}}}}$, ${{\mu }_{{{\omega }_{i}}}}$ and $\sigma _{{{\omega }_{i}}}^{2}$ are redundancy, and we delete them in the revised version.
Point 4: Please briefly introduce the PPO-Clip, PPO-KL, TD3, and DDPG, and indicate the references separately in section 4.
Response 4: We introduced the PPO-Clip, PPO-KL, TD3, and DDPG in the experiment section in the revised version.
Point 5: In section 4.2, I can't get the same conclusion that “in all variations of PPO, the PPO-Exp has the best performance” as you from Fig. 6(a). Please explain in detail.
Response 5: It is certain that the curve of PPO-Exp is higher than the other curves except for PPO-KL, and compared to PPO-KL, the curve of PPO-Exp is higher than PPO-KL in most time, so we think the PPO-Exp achieve the best performance.
Point 6: In section 4.2, the conclusion of “the 0.05 is the balance point between exploration and exploitation found by PPO-Exp” confuses me, please explain which one is better, the series ε(t) or ε=0.05.
Response 6: This is an important problem, we do the experiment with ε=0.05, and the experimental results are shown in the ablation experiment part.
Point 7: Can the method proposed in this paper be applied to a larger UAVs formation? Please explain in detail.
Response 7: We added a new section ”Related Work” after the section I, in the new section, we discussed the formation scale in the centralized approaches and pointed out that these approaches are hard to deal with the large-scale formation due to the dimensional problem. we think the proposed method could not straight deal with the large-scale formation problem, but the proposed method is the basis of the Hierarchical PPO, which is proposed in our another preparing work. In the preparing work, we researched the multi-task flocking problem of large-scale fixed-wing UAV swarms and used multi-exploration advantages as the hierarchical target to guide the policy update via multi-task.
Author Response File: Author Response.pdf
Reviewer 2 Report
Good and original work in general terms. Authors presented a RL based UAV Formation coordination algorithm and compared their results with some other state of the art Reinforcement Learning methods. An ablation experiment was carried out and performance metrics were computed and compared with previous methods.
The research style is appropiate and results are clearly displayed. However, it could be interesting testing the flight formation keeping system in other scenarios, such as for instance in turns or maybe in an obstacle avoidance maneuvre.
Besides, there are some grammar and writting errors that should be corrected such as for instance in lines 65 and 104 among others.Also, the authors do not leave a space before introducing the parentheses. These are minor bugs that should be fixed.
Author Response
Point 1: However, it could be interesting testing the flight formation keeping system in other scenarios, such as for instance in turns or maybe in an obstacle avoidance maneuvre.
Response 1: We added a new section: ”Related Work” after section I, in the new section, we discussed the feasibility of multi-task in the centralized approaches and pointed out that these approaches are hard to deal with the multi-task scenarios due to the dimensional problem. we think the proposed method could not straight deal with the multi-task, but the proposed method is the basis of the Hierarchical PPO, which is proposed in our another preparing work. In the preparing work, we researched the multi-task flocking problem of large-scale fixed-wing UAV swarms and used multi-exploration advantages as the hierarchical target to guide the policy update via multi-task.
Point 2: There are some grammar and writting errors that should be corrected such as for instance in lines 65 and 104 among others.Also, the authors do not leave a space before introducing the parentheses
Response 2: We checked the paper and revised all of the spelling and sentence errors that we found, and we deleted the space before the parentheses
Author Response File: Author Response.pdf
Reviewer 3 Report
The paper idea is good; however, there are many things that need to be addressed first.
What are the main differences (pros and cos) between centralised (as proposed) and decentralised approached for UAV formation and control? Authors would be encouraged to add a related work section highlighting the literature and these aspects.
The paper states “The goal of the task is to reach the target area( See the green circle area in Fig. 1) with the formation as orderly as possible when the leader enters the target area; the mission is complete.” This goal is very simple, how about other types of missions such as surveillance, reconnaissance, photography, tracking mobile targets, etc.; how would the formation of UAV behave? The authors would need to show the significance of the proposed in context of different missions and objectives.
What is the case when the leader (with the intelligence chip) drops out? How would the remaining flock act in this case?
Figure 2, what is the until/label for the y-axis colour gradient 0 to 2?
The authors showed only one type of formation, how about other formations such as leader and all followers on the same vertical or horizontal line? how the control strategy would work for any type of defined formation?
The authors need to compare the proposed decentralised approach with other existing approaches in the literature in light of what they present in the related work section.
I would say that this paper is more on the communication optimisation between multi-UAVs rather than on the control of multi UAV formation.
Please provide more attention to the language style.
Author Response
Point 1: What are the main differences (pros and cos) between centralised (as proposed) and decentralised approached for UAV formation and control? Authors would be encouraged to add a related work section highlighting the literature and these aspects. The authors showed only one type of formation, how about other formations such as leader and all followers on the same vertical or horizontal line? how the control strategy would work for any type of defined formation?
Response 1: We added a new section ”Related Work” after section I, in the new section, we discussed the centralized & decentralized approaches. Compared to our approach, the current centralized approaches don’t consider communication in formation.
Point 2: The paper states “The goal of the task is to reach the target area( See the green circle area in Fig. 1) with the formation as orderly as possible when the leader enters the target area; the mission is complete.” This goal is very simple, how about other types of missions such as surveillance, reconnaissance, photography, tracking mobile targets, etc.; how would the formation of UAV behave? The authors would need to show the significance of the proposed in context of different missions and objectives.
Response 2: The proposed approach could deal with the single task, such as tracking mobile target by changing the fixed target coordinate to the random walk model in the training process. It has less difference from the target reach task, but requires more training time, due to the target should be set as the random walk to improve the generalization ability of the agent. If the reviewer has a better reason, we would like to add them.
Point 3: The authors showed only one type of formation, how about other formations such as leader and all followers on the same vertical or horizontal line? how the control strategy would work for any type of defined formation?
Response 3: Using the vertical and horizontal line formation reward function, then the vertical and horizontal line formation keeping would be realized by our approach. However, one of the contributions of this paper is to set up a reward function and consider the communication protocol for the specific formation. So, showing the vertical and horizontal lines will make this contribution doesn’t seem obvious. If the reviewer has a better reason, we would like to add them.
Point 3: What is the case when the leader (with the intelligence chip) drops out? How would the remaining flock act in this case?
Response 3: We discussed this situation in section II. It is a generalized problem in leader-follower structure UAV systems. The paper[37] pointed out that the defect or jamming of the leader will cause failure in the whole system. In our system, we have come up with a way to deal with this case: set the followers to return when losing communication with the leader. If the leader is shot down, the other drones will be able to sense it through the communication protocol, and they return immediately. It is an essential problem have existed for a long time, and we think it requires further study.
Point 4: Figure 2, what is the until/label for the y-axis colour gradient 0 to 2?
Response 4: The label of Fig.2 refers to the communication priority that we illustrate in paragraph 270-274. We already added the “Priority” in Fig. 2.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
I'm not totally satisfied with the provided responses to my earlier comments. I feel the authors have overlooked many details that can be provided to enhance the quality of this paper.
Author Response
Point 1: The paper states “The goal of the task is to reach the target area( See the green circle area in Fig. 1) with the formation as orderly as possible when the leader enters the target area; the mission is complete.” This goal is very simple, how about other types of missions such as surveillance, reconnaissance, photography, tracking mobile targets, etc.; how would the formation of UAV behave? The authors would need to show the significance of the proposed in context of different missions and objectives.
Response 1: We rethink this adive. Considering the theme of this paper, we add the experiments on the formation keeping with obstacle task and formation changing task, which are more complex than the original experiments. The experiments and the significance of the proposed in context of different missions and objectives are shown in the last part of section VI.
Point 2: The authors showed only one type of formation, how about other formations such as leader and all followers on the same vertical or horizontal line? how the control strategy would work for any type of defined formation?
Response 2: In last part of section VI, we considered the formation changing task, which included the vertial line formation, the experiments show the PPO-Exp can perform better than PPO-Clip. The control strategy of this task is described as well.
Point 3: Please provide more attention to the language style.
Response 3: Considering that the reviewers recommended English language editing, we bought the English editing service of MDPI and the language style already get improvement.
Author Response File: Author Response.docx