Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

COLREGs-Based Path Planning for USVs Using the Deep Reinforcement Learning Strategy

J. Mar. Sci. Eng. 2023, 11(12), 2334; https://doi.org/10.3390/jmse11122334

by Naifeng Wen, Yundong Long, Rubo Zhang^*, Guanqun Liu^*

, Wenjie Wan and Dian Jiao

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Jonghoek Kim

J. Mar. Sci. Eng. 2023, 11(12), 2334; https://doi.org/10.3390/jmse11122334

Submission received: 30 October 2023 / Revised: 7 December 2023 / Accepted: 7 December 2023 / Published: 11 December 2023

(This article belongs to the Special Issue Autonomous Marine Vehicle Operations—2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents COLRESs Based Path Planning for USVs Using the Deep Reinforcement Learning Strategy. While the paper is interesting but it needs a lot of improvement:

1. The abstract must be improved by presenting what is done, the results and some benefits of the study.

2. Literature review requires a lot of work; the authors must follow the standard way of reviewing the paper as well as presenting the novelty of the work based on that.

3. Literature [6] must mentioned according to the journal format.

4. Novelty must be clear based on the other papers mentioned before.

5. Why MAPPO Algorithm is used? Please clarify in the manuscript.

6. What kind of software do you use to perform the simulation?

7. The conclusion must be improved and it is better to mention it as a point after some brief description of the work.

8. What is the lab that the experiment has been performed? Please mention and describe.

9. Please follow the format of journal in writing the refs.

10. How did you collect the data, what kind of sensors you use?

11. How did you ensure the accuracy of the results?

Author Response

Thank you very much for taking the time to review this manuscript. Please find the figures and table in the response file below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files.

Thank you very much for the careful reading of our manuscript and valuable suggestions. Based on the comments, we have made modifications on the revised manuscript. Detailed revisions were shown as follows. The changes to our manuscript within the document were also highlighted by using red text.

The paper presents COLRESs Based Path Planning for USVs Using the Deep Reinforcement Learning Strategy. While the paper is interesting but it needs a lot of improvement:

Response to Reviewer 1

The changes to our manuscript within the document were also highlighted by using red text.

Comment 1.

The abstract must be improved by presenting what is done, the results and some benefits of the study.

Response. Thank you for your valuable suggestion. Based on you commend, we have revised the whole abstract of our manuscript to provide a clear overview of our study, results and the benefits.

The rewritten abstract is presented as below.

“Abstract: This research introduces a two-stage deep reinforcement learning approach for the cooperative path planning of Unmanned Surface Vehicles (USVs). The method is designed to address cooperative collision avoidance path planning while adhering to the International Reg-ulations for Preventing Collisions at Sea (COLREGs) and considering the collision avoidance problem within the USV fleet and between USVs and Target Ships (TSs). To achieve this, the study presents a dual COLREGs-compliant action selection strategy to effectively manage the vessel avoidance problem. Firstly, we construct a COLREGs-compliant action evaluation network that utilizes a deep learning network trained on pre-recorded TS avoidance trajectories by USVs in compliance with COLREGs. Then, the COLREGs-compliant reward function-based action selec-tion network is proposed by considering various TS encountering scenarios. Consequently, the results of the two networks are fused to select actions for cooperative path planning processes. The path planning model is established using the Multi-Agent Proximal Policy Optimization (MAPPO) method. The action space, observation space, and reward function are tailored for the policy network. Additionally, a TS detection method is introduced to detect the motion intentions of TSs. The study conducted Monte Carlo simulations to demonstrate the strong performance of the planning method. Furthermore, experiments focusing on COLREGs-based TS avoidance were carried out to validate the feasibility of the approach. The proposed TS detection model exhibited robust performance within the defined task”.

Comment 2.

Literature review requires a lot of work; the authors must follow the standard way of reviewing the paper as well as presenting the novelty of the work based on that.

Response. Thank you very much for your constructive commend, and it is important in enhancing the literature review and effectively improve the presenting of the novelty of the work.

According to the commend, we have carefully revised the literature review following the standard way as well as clearly presenting the novelty of the work based on that.

We have referenced the latest contributions in the field of COLREGs-based path planning to enhance our literature review.

The newly added references as listed below.

Maza, J. A. G.; Argüelles, R. P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 2022, 261, 112029.
Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.
Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A review on COLREGs-compliant navigation of autonomous surface vehicles: From traditional to learning-based approaches. Journal of Automation and Intelligence, 2022, 1(1), 100003.
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Networks: The Official Journal of the International Neural Network Society, 2022,152.
Li, L.; Wu, D.; Huang, Y.; Yuan, Z. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Applied Ocean Research, 2021, 113, 102759.

17 Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods.In International conference on Continuous control with deep reinforcemmachine learning. PMLR, 2018, 1587-1596.

Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35, 24611-24624.
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent ship collision avoidance algorithm based on DDQN with prioritized experience replay under COLREGs. Journal of Marine Science and Engineering, 2022, 10(5), 585.
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. Ieee Access, 2020, 8, 165344-165364.
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. Journal of Marine Science and Technology, 2021, 26, 509-524.
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Engineering, 2020, 217, 107704.
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 2022, 10(7), 944.
Wei, G.; Kuo, W. COLREGs-compliant multi-ship collision avoidance based on multi-agent reinforcement learning technique. Journal of Marine Science and Engineering, 2022, 10(10), 1431.

We have added the reviews of the advancements in interpreting and implementing the COLREGs rules, as follows.

“

The development on the International Regulations for Preventing Collisions at Sea –compliant (COLREGs-compliant) navigation is from two dimensions: the complexity of ship encounter scenarios and the evolution in methodologies.

The studies [7], [8], and [9] thoroughly analyzed and interpreted various scenarios in which Unmanned Surface Vehicles (USVs) may encounter other vessels during their navigation, as well as the COLREGs rules that all vessels must follow to avoid collision. The research [7] provides valuable insights into ship collision avoidance based on COLREG 72, which can be useful for Officers Of the Navigational Watch (OONW) both onboard and remotely, as well as for autonomous systems. The research is sup-ported by examples drawn from various works that, despite their significant influence in current literature, may not have a correct standpoint. The studies [8] contributes to providing an understanding of COLREGs sailing rules based on the insights of navigators and researchers. In the study [9], the recent progress in COLREGs-compliant navigation of USVs from traditional to learning-based approaches is reviewed in depth.

There are mainly four classes of conventional methods that take great effort to accommodate COLREGs rules in path planning modules [9].

Rule-based methods involved hand-crafted design and focused on simple ship-encounter scenarios, and hence are hard to be extended for more complex ship encounter scenarios. Hybrid methods, such as A*-variants and Rapidly-exploring Random Tree-variants, suffers from the complexity of the multi-TS avoiding problem and the high dimension of the multi-USV planning space. Reactive methods, such as the artificial potential field and velocity obstacle methods, have difficulties in uncertain TS course prediction. Optimization-based methods are also hard to be conducted in complex TS encounter scenarios.

Additionally, traditional researches usually focused on simple 1-1 ship encounter scenarios dictated by rules 13–16 in Part B of COLREGs. However, the USV path planning method becomes more challenging when considering multi-TS encounter scenarios. When planning those algorithms in simulated or real scenarios, simplified and basic assumptions of COLREGs are far from its real complexity [7, 8].

Therefore, traditional methods are not able to fully use the seamanship of experienced mariners in solve complex situations, and they can hardly be considered as the powerful nonlinearly approximators of the optimal value and policy functions

”.

We have incorporated the following sentences to offer the reader a vivid comprehension and impression of the deep reinforcement learning method.

“

Different from traditional methods, as a paradigm in the field of Machine Learning (ML) to achieve multi-agent collaboration, the Deep Reinforcement Learning (DRL) model mainly studies the synchronous learning and evolution of agent strategy that is used to plan the coordinated movement of formation in real time [10]. By interacting with the environment, it continuously optimizes the agent's action strategy. The value function of different action strategies in the current state is estimated, and high-return actions are executed to avoid performing low-return or punitive actions. The deep network module of DRL is utilized to fit the motion model of USVs, enabling smooth control and avoiding falling into the local optimal solution [10]. DRL is well-suited for situations where the optimal decision-making strategy is not known beforehand and for dealing with non-stationary environments where the underlying dynamics may change over time [10]. These characteristics make it an effective and powerful tool for solving our task.

DRL methods can be divided into three categories [11]. The Value-based methods, such as DQN-variants, estimate the optimal values of all different states, then derive the optimal policy using the estimated values [12, 13]. The policy-based methods, such as Trust Region Policy Optimization (TRPO) [14] and Proximal Policy Optimization (PPO) [15], optimize the policy directly without maintaining the value functions. The actor–critic methods, such as Deep Deterministic Policy Gradient (DDPG) [16], Twin Delayed DDPG (TD3) [17] and Soft Actor-Critic (SAC) [18], can be viewed as a combi-nation of the above two methods, and they maintain an explicit representation of both the policy (the actor) and the value estimates (the critic)

”.

We have revised the description of the fundamental DRL methods.

“

A vanilla policy gradient method, namely TRPO, ensures that each update to the policy parameters stays within a trusted region to prevent large parameter updates that might degrade the policy performance. It can handle high-dimensional state spaces [14]. However, it is relatively complicated, and is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks), and it has poor data efficiency. To address this problem, the Proximal Policy Optimization (PPO) method emerged. PPO converts the constraint term of TRPO into a penalty term, to decrease the complexity of con-strained optimization problems, and it uses only first-order optimization [15]. Multi-Agent PPO (MAPPO) is a version for the multi-agent partial observed Markov's decision-making process [19]. The Multi-Agent DDPG (MADDPG) method is another state-of-the-art method besides MAPPO. In MADDPG, each agent takes the other agents as part of the environment, and agents in the same region cooperate with each other to determine the optimal coordinated action [16]. However, it is hard to achieve the stability, due to the complexity of the hyperparameters

”.

We have added some COLREGs-based collision avoidance methods via the deep reinforcement learning model to improve the literature review.

“

In the paper [20], a multi-USV automatic collision avoidance method was employed based on a Double Deep Q network (DDQN) with prioritized experience replay. In the study [21], the PPO algorithm and hand-crafted a reward function are used to encourage the USV to comply with the COLREGs rules.

Sawada et al. [22] proposed a collision avoidance method based on PPO, and it uses grid sensor to quantize obstacle zones by target and uses a Convolutional Neural Network (CNN) and Long-Short-Term Memory (LSTM) network to control the rudder angle. Xu et al. [23] proposed a COLREGs Intelligent Collision Avoidance (CICA) algorithm, and it tracks the current network weight to update the target network weight, which improves the stability of learning the optimal strategy.

The study [24] employed a collision avoidance framework that divides all en-counter scenarios into seven types according to the avoidance constraints of the COLREGs for different encountered scenes.

In the study [25], the COLREGs and ship maneuverability were considered in the reward for achieving multi-ship automatic collision avoidance, and the Optimal Reciprocal Collision Avoidance (ORCA) algorithm was used to detect and reduce the risk of collision

”.

We have included a table to demonstrate the advantages and limitations of classic path re-planning methods and DRL methods.

“

Tbl.1. Introduction to comparative methods

	Method	Advantage	Limitation
Path re-planning methods	Rule based methods	COLREGs rules integrated into path re-planning	relying on hand-crafted design, hard to extended for complex ship encounter scenarios
	Hybrid methods: A*	fast, COLREGs-compliance incorporated into path re-planning	hard to extended for more complex ship encounter scenarios
	Reactive methods: Velocity obstacle	fast, COLREGs compliance enforced by integrating forbidden zones	accurate TS course information required
	Optimization-based methods	optimal, COLREGEs rules naturally formulated as constraints	relative high computation burden
DRL methods	Value-based methods: Q-learning, DQN	optimal policy derived from estimates of the optimal values of all different states	overestimation bias, high dimensionality, hard to strike a balance between exploration and exploitation
	Policy-based methods: MAPPO, TRPO	directly optimizing the policy without maintaining the value functions	careful design of reward functions required
	Actor–critic methods: MADDPG, MATD3,	explicit representation of both the policy and the value estimates	computationally expensive, hyperparameter sensitivity

”

To provide the motivations for our study, we have analyzed the existing unresolved difficult issues or problems in the field of COLREGs-based path planning as follows.

“

The majority of current DRL methods focuses on generating COLREGs-compliant paths using a COLREG-based reward function. However, the high randomness in ear-ly-stage action selection can lead to unpredictable strategy gradient updates, making it challenging to achieve model convergence.

Moreover, in the case of USVs, it is crucial for their paths to be both feasible and optimal, while ensuring multiple USVs can maintain formation and reach their respective goals [30, 31]. Additionally, being COLREGs-compliant does not guarantee an ideal evasive behavior as it may result in overly conservative or inattentive responses to unexpected TSs.

By leveraging a large dataset of pre-recorded USV trajectory data and employing powerful DRL-based methods, it is possible to derive promising solutions that not only ensure USV navigation adheres to COLREGs rules but also replicates the good seamanship exhibited by experienced mariners. Furthermore, path planning in dynamic environments poses a complex problem due to the need to plan multiple paths simultaneously while ensuring collision avoidance within the USV fleet and between USVs and TSs. Therefore, it is crucial that the planner is efficient. MAPPO offers high efficiency in learning, fast convergence, and improved stability, making it an ideal choice as the basic path planner for our task

”.

Subsequently, we explicitly outlined the novelties of our research as follows:

“

Subsequently, this research seeks to propose a two-stage Multi-Agent Reinforce-ment Learning (MARL) scheme based on the MAPPO algorithm, incorporating a cen-tralized training and decentralized execution strategy. The following innovations are presented:

We introduce a COLREGs-compliant action evaluation module to compute ac-tion probabilities that align with COLREGs regulations when encountering multiple TSs. The module parameters are learned from a dataset of pre-recorded USV trajecto-ries. By fusing the probability vector and candidate action vector from the actor net-work, we select an action that is most feasible for encountered situation. Our reward function incorporates both COLREGs and seamanship considerations, providing a dual heuristic approach to guide the selection of COLREGs-compliant actions.
We propose a policy network that can handle multiple aggregation goals, obsta-cles, and dynamic TSs. To achieve this, we have defined the action space, observation space, and reward function for the policy network. Additionally, we have designed actor and critic networks.
A TS motion detection network is constructed to provide guidance for the deci-sion-making process of the MARL model.

”.

Comment 3.

Literature [6] must mentioned according to the journal format.

Response. Thank you for bringing the formatting issue to our attention. We apologize for any inconvenience caused. Upon receiving your comment, we have thoroughly reviewed and corrected the format of all the references, including reference [6], to ensure they align with the appropriate journal format. We appreciate your understanding and attention to detail in helping us maintain the accuracy and professionalism of our work.

Comment 4.

Novelty must be clear based on the other papers mentioned before.

Response. Thank you for your helpful suggestion. Accordingly, we clearly provided the novelties of our research based on the other papers mentioned before.

We have incorporated the latest contributions in the field of COLREGs-based path planning to enhance our literature review. Additionally, we have included reviews of advancements in interpreting and implementing the COLREGs rules. Then we have offered the reader a vivid comprehension and impression of the deep reinforcement learning method. We have also revised the description of the fundamental DRL methods. We have also included some COLREGs-based collision avoidance methods via the deep reinforcement learning model. To provide motivation for our study, we have analyzed the existing unresolved difficult issues or problems in the field of COLREGs-based path planning. Subsequently, we have clearly outlined the novelties of our proposed method.

The revised section of the novelties is as follows.

“

Subsequently, this research seeks to propose a two-stage Multi-Agent Reinforcement Learning (MARL) scheme based on the MAPPO algorithm, incorporating a centralized training and decentralized execution strategy. The following innovations are presented:

We introduce a COLREGs-compliant action evaluation module to compute action probabilities that align with COLREGs regulations when encountering multiple TSs. The module parameters are learned from a dataset of pre-recorded USV trajectories. By fusing the probability vector and candidate action vector from the actor net-work, we select an action that is most feasible for encountered situation. Our reward function incorporates both COLREGs and seamanship considerations, providing a dual heuristic approach to guide the selection of COLREGs-compliant actions.
We propose a policy network that can handle multiple aggregation goals, obstacles, and dynamic TSs. To achieve this, we have defined the action space, observation space, and reward function for the policy network. Additionally, we have designed actor and critic networks.
A TS motion detection network is constructed to provide guidance for the decision-making process of the MARL model

”.

Comment 5.

Why MAPPO Algorithm is used? Please clarify in the manuscript.

Response. Thank you for your question; it is highly valuable in enhancing the readability of our paper and justifying our choice of the basic planner.

MAPPO's ability to handle multi-agent interactions and effectively learn policies makes it well-suited to address the complexities of path planning in this context. By leveraging MAPPO's capabilities, we can ensure that multiple paths are planned concurrently, taking into account the dynamic nature of the environment and the need for collision avoidance among the USVs and other entities.

The high efficiency of MAPPO in learning allows for rapid adaptation to changes in the environment, enabling the planner to make informed decisions based on real-time conditions. This is particularly valuable in dynamic environments where circumstances can change quickly.

In addition, the inclusion of trust region updates in MAPPO enhances its stability during the training process. By preventing policy oscillations, MAPPO ensures that the learned policies are consistent and reliable. This stability is essential in maintaining safe and efficient path planning, especially when dealing with multiple interacting agents.

Subsequently, we have chosen MAPPO as the basic path planner for our USVs due to its high learning efficiency, fast convergence, and improved stability. By utilizing these strengths, we can effectively address the complexities of path planning in dynamic environments, ensuring optimized trajectories while prioritizing collision avoidance and coordination among the USVs and other entities present in the environment.

According to your comment, we have clarified the reason for choosing MAPPO algorithm as our basic planner in the Introduction section as follows.

“

Tbl.1. Introduction to comparative methods

	Method	Advantage	Limitation
Path re-planning methods	Rule based methods	COLREGs rules integrated into path re-planning	relying on hand-crafted design, hard to extended for complex ship encounter scenarios
	Hybrid methods: A*	fast, COLREGs-compliance incorporated into path re-planning	hard to extended for more complex ship encounter scenarios
	Reactive methods: Velocity obstacle	fast, COLREGs compliance enforced by integrating forbidden zones	accurate TS course information required
	Optimization-based methods	optimal, COLREGEs rules naturally formulated as constraints	relative high computation burden
DRL methods	Value-based methods: Q-learning, DQN	optimal policy derived from estimates of the optimal values of all different states	overestimation bias, high dimensionality, hard to strike a balance between exploration and exploitation
	Policy-based methods: MAPPO, TRPO	directly optimizing the policy without maintaining the value functions	careful design of reward functions required
	Actor–critic methods: MADDPG, MATD3,	explicit representation of both the policy and the value estimates	computationally expensive, hyperparameter sensitivity

The comparative methods are outlined in Tab.1. The majority of current DRL methods focuses on generating COLREGs-compliant paths using a COLREG-based reward function. However, the high randomness in early-stage action selection can lead to unpredictable strategy gradient updates, making it challenging to achieve mod-el convergence.

We introduce a COLREGs-compliant action evaluation module to compute ac-tion probabilities that align with COLREGs regulations when encountering multiple TSs. The module parameters are learned from a dataset of pre-recorded USV trajectories. By fusing the probability vector and candidate action vector from the actor net-work, we select an action that is most feasible for encountered situation. Our reward function incorporates both COLREGs and seamanship considerations, providing a dual heuristic approach to guide the selection of COLREGs-compliant actions.
We propose a policy network that can handle multiple aggregation goals, obstacles, and dynamic TSs. To achieve this, we have defined the action space, observation space, and reward function for the policy network. Additionally, we have designed actor and critic networks.
A TS motion detection network is constructed to provide guidance for the decision-making process of the MARL model.

”.

We conducted a comparative analysis of the MAPPO algorithm and the MADDPG method in Section 5.3 titled “Path Planning Results”. Additionally, we performed an experiment comparing MAPPO with the MATD3 method. The average path length and average steering angles on the paths in the Monte Carlo simulations were also provided. The details of our findings are presented as follows.

“

We training our planner for totally 1e+6 and 6e+6 iterations, respectively. We evaluate the network and record metric values per 5000 training iterations, and the evaluation reward curves are shown in Fig.11. The action space dimension is 7 and the observation space dimension is 24. Figures 12 and 13 depict the training reward curves of the MADDPG and Multi-Agent TD3 (MATD3) algorithms. MATD3 improves upon the original DDPG algorithm by incorporating additional techniques, such as using twin critic networks and delayed policy updates, to improve coordination and learning in complex environments. Due to the complexity of the training process and the long duration required for each iteration, we trained both the MADDPG and MATD3 algorithms for 1e+6 iterations.

We observed that the reward curve of our method converges as the training iterations approach 1e+6. The oscillation is relatively low, indicating that the method has relatively high stability. Fig.11b shows the training reward curve for 6e+6 training iterations. The rewards decrease as the training iterations increase, indicating the method scales well with a large number of training iterations.

The curve became steady as the iteration time approaches 5e+6, probably means the model was sufficiently trained. Since training efficiency is also an important metric for the online planner model, we benchmarked our method against the state-of-the-art MADDPG model and MATD3 model after training for 1e+6 iterations. The oscillations in the reward curves of MADDPG and MATD3 are significant and the decreases in rewards are not obvious. The observation possibly implies that MADDPG and MATD3 may require more training iterations than our method.

Meanwhile, the duration of training for our MAPPO-based method with 6e+6 iterations was less than 5 hours, whereas the training durations for MADDPG and MATD3 with only 1e+6 iterations exceeded 10 hours. The above observations demonstrate that our method has much higher training efficiency than MADDPG and MATD3 due to the low complexity of MAPPO.

The results of the Monte Carlo simulations show that our MAPPO algorithm out-performs both the MADDPG and MAPPO algorithms in terms of average path length and steering angle. Specifically, the average path length of our MAPPO algorithm is 0.489, which is lower than that of the MADDPG and MAPPO algorithms (0.683 and 0.582, respectively). This indicates that our algorithm is able to plan more optimal paths. Moreover, the average steering angle of our MAPPO algorithm is 2.14 rad, which is lower than that of the MADDPG and MATD3 algorithms (2.35 and 2.23, respectively). This suggests that our algorithm is also more efficient in steering the vehicle. Overall, these observations provide evidence that our method is capable of planning more optimal paths than the MADDPG and MATD3 models after being trained for 1e+6 iterations in the simulation environment of the Monte Carlo simulations. This result highlights the potential of our algorithm to improve the performance of autonomous driving systems.


(a)	(b)

Fig.11. Reward curves of MAPPO-based planner. (a) reward curve of training for 1e+6 iterations; (b) reward curve of training for 6e+6 iterations.

Fig.12. Reward curve of MADDPG

Fig.13. Reward curve of MATD3

”

Comment 6.

What kind of software do you use to perform the simulation?

Response. Thank you for your insightful question. We used Pytorch to build the neural network, we modified the OpenAI Gym environment to construct our simulation environment and to realize the visual display. Gym provides a collection of environments and tools for developing and comparing reinforcement learning algorithms. It offers a standardized interface for interacting with various simulation environments. And we use the Matplotlib to draw figures.

Comment 7.

The conclusion must be improved and it is better to mention it as a point after some brief description of the work.

Response. Thank you for your valuable comment. We greatly appreciate your comment as it has helped us enhance the quality of the conclusion. In accordance with your suggestion, we have revised the conclusion to include a brief description of the work followed by the mentioned conclusion point.

“

This research proposes a two-stage path planning method for multiple USVs based on COLREGs. The method combines a cooperation module, a COLREGs-compliant action evaluation module, and a TS detection module. The cooperative path planning model for collision avoidance among USVs is constructed based on the MAPPO strategy, which utilizes a policy network capable of handling multiple aggregation goals, obstacles, and TSs. To achieve this, we define the action space, observation space, and reward function for the policy network, and design actor and critic networks.

Monte Carlo experimental results confirm the effectiveness and efficiency of our path planning method for formation aggregation and collision avoidance. We conducted these experiments by randomly specifying the positions of USVs and obstacles. This approach allowed us to evaluate the performance of our method in diverse scenarios and validate its robustness.
We benchmarked the simulation results against the MADDPG and MATD3 methods to validate the efficiency and the optimization performance of our approach.

After training the COLREGs-compliant action evaluation calculation module using TS-avoiding trajectories, violations of USV actions that go against COLREGs can be recognized and suppressed. This judgment is then used as heuristics for the actor network. Our reward function considers both COLREGs and seamanship principles.

We conducted further experiments to test the feasibility of our collision avoidance scheme based on COLREGs within the USV fleet, as well as between USVs and TSs. We were able to confirm the practicality and effectiveness of our method in a realistic scenario.

The TS detection network is constructed based on the YOLO network and the squeeze-and-excitation scheme.

Our proposed TS detection model performs well in our specific environment.

Primarily, we evaluate the algorithm through simulations conducted in Gym environments. Additionally, we have conducted experiments on a semi-physical simulation platform where our algorithm acts as a local path planner, guiding the navigation of the system. The cooperative path planning module is executed on individual ship-borne computers (PC-104) installed on each USV member. VxWorks 6.6 is utilized as the operating system, with Workbench 3.0 as the chosen development package.

In our future experiments, we aim to further advance our research by translating the algorithm into physical USVs. This will allow us to conduct practical implementation and comprehensive testing of the algorithm's capabilities

”.

Comment 8.

What is the lab that the experiment has been performed? Please mention and describe.

Response. Thank you for your insightful question.

Primarily, we evaluate the algorithm through simulations conducted in Gym environments.

We have testified the performance of our planner by Monte Carlo simulations to confirm the effectiveness and efficiency of our path planning method for formation aggregation and collision avoidance. We conducted these experiments by randomly specifying the positions of USVs, obstacles, and TSs. This approach allowed us to evaluate the performance of our method in diverse scenarios and validate its robustness.
We conducted further simulation experiments to test the feasibility of our collision avoidance scheme based on COLREGs within the USV fleet, as well as between USVs and TSs. We were able to confirm the practicality and effectiveness of our method in a realistic scenario.
The TS detection network was testified on the SeaShips dataset using the 10-fold cross-validation method that used 70% images as the training set while 30% images as the test set. The training set and test set did not intersect.
In addition to simulations in Gym environments, we have also conducted experiments on our semi-physical simulation platform. In this platform, our algorithm serves as a local path planner for guiding the navigation of the system.

Fig. 1 illustrates the architecture of the platform, where submodules establish communication with the central control module using the Transmission Control Protocol/Internet Protocol (TCP/IP), and the communication thread is realized through socket programming technology.

To ensure synchronized operation, we have implemented a system timer that interrupts modules at fixed time intervals within each communication iteration. This synchronization is further upheld by aligning the clocks in submodules with the system clock.

In order to uphold the accuracy and reliability of communication, specific communication rules have been established between the central controller and functional modules:

Fig.1. Architecture of the USVs online navigation system

The cooperative path planning module is executed on individual ship-borne computers (PC-104) installed on each USV member. The operating system used is VxWorks 6.6, and the development package utilized is Workbench 3.0.

The execution procedure is visualized using the OpenGL Library and the Vega software.

In our future experiments, we plan to translate the algorithm into physical USVs for practical implementation and testing.

According to the comment, we mentioned and described our lab in the Conclusion section as follows.

“

”.

We also added the following sentences in TS experimental environment in the Section 5.5 titled “TS Detection Result”.

“

In the evaluation of the TS detection network on the SeaShips dataset, the 10-fold cross-validation method was employed. This method involves dividing the dataset into 10 subsets or folds of approximately equal size. For each fold, 70% of the images were used as the training set, while the remaining 30% of images were used as the test set. It is important to note that the training set and the test set did not have any overlapping images

”.

Comment 9.

Please follow the format of journal in writing the refs.

The revised references are listed below.

“

Campbell, S.; Naeem, W.; Irwin, G.W. A review on improving the autonomy of unmanned surface vehicles through intelligent collision avoidance manoeuvres. Annual Reviews in Control. 2012, 36(2), 267-283.
Chakravarthy, A.; Ghose, D. Obstacle Avoidance in a Dynamic Environment: A Collision Cone Approach. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 1998, 28(5), 562-574.
Liang, X.; Qu, X.; Wang, N.; Li, Y.; Zhang, R. Swarm control with collision avoidance for multiple underactuated surface vehicles. Ocean Engineering. 2019, 191, 106516.
Liang, X.; Qu, X.; Wang, N.; Li, Y.; Zhang, R. A Novel Distributed and Self-Organized Swarm Control Framework for Underactuated Unmanned Marine Vehicles. IEEE Access. 2019, 7, 112703-112712.
Xia, J.; Luo, Y.; Liu, Z.; Zhang, Y.; Shi, H.; Liu, Z. Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning. Defence Technology, 2022, 29, 80-94.
Xue, D.; Wu, D.; Yamashita, A. S.; Li, Z. Proximal policy optimization with reciprocal velocity obstacle based collision avoidance path planning for multi-unmanned surface vehicles. Ocean Engineering, 2023, 273, 114005.
Maza, J. A. G.; Argüelles, R. P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 2022, 261, 112029.
Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.
Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A review on COLREGs-compliant navigation of autonomous surface vehicles: From traditional to learning-based approaches. Journal of Automation and Intelligence, 2022, 1(1), 100003.
Yang, Y.; Wang, J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv, 2020, 2011,
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B. Deep reinforcement learning: a survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Networks: The Official Journal of the International Neural Network Society, 2022,152.
Li, L.; Wu, D.; Huang, Y.; Yuan, Z. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Applied Ocean Research, 2021, 113, 102759.
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In International conference on machine learning. PMLR, 2015, 1889-1897.
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv, 2017, 1707, 06347.
Lowe, R.; Wu, Y. I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 2017, 30.
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods.In International conference on Continuous control with deep reinforcemmachine learning. PMLR, 2018, 1587-1596.
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv, 2015,1509.02971.
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35, 24611-24624.
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent ship collision avoidance algorithm based on DDQN with prioritized experience replay under COLREGs. Journal of Marine Science and Engineering, 2022, 10(5), 585.
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. Ieee Access, 2020, 8, 165344-165364.
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. Journal of Marine Science and Technology, 2021, 26, 509-524.
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Engineering, 2020, 217, 107704.
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 2022, 10(7), 944.
Wei, G.; Kuo, W. COLREGs-compliant multi-ship collision avoidance based on multi-agent reinforcement learning technique. Journal of Marine Science and Engineering, 2022, 10(10), 1431.
Rongcai, Z.; Hongwei, X.; Kexin, Y. Autonomous collision avoidance system in a multi-ship environment based on proximal policy optimization method. Ocean Engineering, 2023, 272, 113779.
Skrynnik, A.; Yakovleva, A.; Davydov, V.; Yakovlev, K.; Panov, A. I. Hybrid policy learning for multi-agent pathfinding. IEEE Access, 2021, 9, 126034-126047.
Wang, K.; Kang, B.; Shao, J.; Feng, J. Improving generalization in reinforcement learning with mixture regularization. Advances in Neural Information Processing Systems, 2020, 33, 7968-7978.
Khoi, N.D.H.; Van, C.P.; Tran, H.V.; Truong, C.D. Multi-Objective Exploration for Proximal Policy Optimization.In 2020 Applying New Technology in Green Buildings (ATiGB). IEEE, 2021, 105-109.
Tam, C.K.; Richard, B. Collision risk assessment for ships. Journal of Marine Science & Technology. 2010; 15, 257-70.
Statheros, T.; Howells, G.; Maier, K.M. Autonomous ship collision avoidance navigation concepts, technologies and techniques. The Journal of Navigation. 2008; 61, 129-42.
Wen, N.; Zhao, L.; Zhang, R.B.; Wang, S.; Liu, G.; Wu, J.; Wang, L. Online paths planning method for unmanned surface vehicles based on rapidly exploring random tree and a cooperative potential field. International journal of advanced robotic systems, 2022, 19(2), 1-22.
Shao ,Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. SeaShips: A Large-Scale Precisely Annotated Dataset for Ship Detection. In IEEE Transactions on Multimedia, 2018, 20(10), 2593-2604.
Kim, J.H.; Kim, N.; Park, Y.W.; Won, C.S. Object detection and classification based on YOLO-V5 with improved maritime dataset. Journal of Marine Science and Engineering, 2022, 10(3), 377.

”

Comment 10.

How did you collect the data, what kind of sensors you use?

Response. Thank you for your insightful question.

We trained our planner for a total of 1e+6 and 6e+6 iterations, respectively. During training, we evaluated the network and recorded metric values every 5000 iterations. This included saving the algorithm performance metrics, status information of USVs and TSs, and obstacle information using the .npz file format. To highlight this, we rewrote the the fourth paragraph of the Section 5.3 titled “Path Planning Results”, as follows.

“

We training our planner for totally 1e+6 and 6e+6 iterations, respectively. We evaluate the network and record metric values per 5000 training iterations

”.

The vision sensors were used to collect real-time images of TSs and that simulated USV control data from Simulink models were used to observe the status of USVs during the simulation.

Comment 11.

How did you ensure the accuracy of the results?

Response. Thank you for you valuable question. To ensure the accuracy of the results in our simulations, we employed several validation and verification techniques as follows.

We benchmarked the simulation results against the MADDPG and MATD3 methods to validate the optimization performance of our approach in Section 5.3 titled “Path Planning Results”.
We re-conducted the experiments of USVs avoiding TSs in accordance with the COLREGs in Section 5.4 titled “COLREGs based Collision Avoidance Experiments”.

Our objective was to evaluate the qualitative results of the USVs' behavior in scenarios involving the avoidance of TSs. We invited domain experts to analyze our findings. These experts were able to provide valuable insights into the performance of the planner and its compliance with the COLREGs.

The rewritten sections are as follows.

“Fig. 14 depicts the simulation results of USVs avoiding TSs, with randomly generated positions for both USVs and obstacles. The TS paths are also generated randomly. The obstacles are represented by black-filled triangles, rectangles, and circles. The USVs plan their paths to reach aggregation targets, considering both low-cost paths and collision avoidance among USVs and TSs. The red curves represent the paths of the three USVs, while the blue curves depict the paths of the TSs. The start points of the TS paths are indicated by blue triangles.

Fig. 14a and b illustrate two representative scenarios of avoiding collisions in accordance with COLREGs. USVs typically prioritize the most significant collision risks in the nearby area and subsequent collision avoidance issues, taking actions according to COLREGs to avoid collision. In Fig. 14a, the first USV not only overtakes the second USV by turning to the starboard side in accordance with COLREGs, but also gives way to the TS crossing from the right by turning to the starboard side. Similarly, the second USV follows the COLREGs by turning to its starboard side when heading on a TS.

In the scenario illustrated in Fig. 14b, the first USV overtakes the third USV by maneuvering to its starboard side. When a TS crosses from the port side of the third USV, the USV stands on its course if it can pass the TS without risking a collision, in accordance with COLREGs. In the subsequent segments of the voyage, the third USV turns to its port side to overtake the second USV. Simultaneously, the second USV adjusts its course to the starboard side to give way to the TS.

Fig. 14 demonstrates that our path planning method is capable of effectively planning multiple USV paths to achieve targets while ensuring collision avoidance among USVs and TSs in accordance with COLREGs regulations.

(a) (b)

Fig.14. Simulations of USVs avoiding TSs in accordance with COLREGs

”

In Section 5.5 titled “TS Detection Results”, we added the following sentences.

“

”.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

There has been a lot of recent research on automatic collision avoidance. Nonetheless, the contribution and novelty of this study should be emphasized. And the important comments are as follows.

1. As revealed in the author's previous papers and many recent studies, many papers on USV avoidance movements are being published. A statement is needed that highlights the biggest differences between this paper and previous papers studied.

2. It was said that the USV action selection module will be developed according to COLREGs. For applications at sea, it is very important that COLREGs are used. 2.3. You can check it in section. This is referenced, but should be explicitly re-written in this paper.

The reasons are 1) Maza, J. A. G., & Argüelles, R. P. (2022). COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 261, 112029.

2) Kim, J. K., & Park, D. J. (2024). Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 159, 105894.

As stated in Please refer to this paper and supplement it.

3. 5.4 In COLREGs based collision avoidance experiments, were head-on, crossing, and overtaking based on COLREGs? The picture is somewhat awkward. After checking what is a give-way ship and what is a stand-on ship, you need to check the ship actions.

Author Response

Thank you very much for the careful reading of our manuscript and valuable suggestions. Please find the detailed responses with figures and tables below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. Based on the comments, we have made modifications on the revised manuscript. Detailed revisions were shown as follows.

Reviewer 2

The changes to our manuscript within the document were also highlighted by using red text.

There has been a lot of recent research on automatic collision avoidance. Nonetheless, the contribution and novelty of this study should be emphasized. And the important comments are as follows.

Comment 1.

As revealed in the author's previous papers and many recent studies, many papers on USV avoidance movements are being published. A statement is needed that highlights the biggest differences between this paper and previous papers studied.

Response. Thank you very much for your constructive commend, and it is crucial in enhancing the quality of this paper.

According to the commend, we have carefully revised the literature review to highlight the biggest differences between this paper and previous papers studied, as well as clearly presenting the novelty of the work based on that.

The biggest difference between our proposed method and previous Deep Reinforcement Learning (DRL) studies is as follows: while the majority of current DRL methods concentrate on generating COLREGs-compliant paths using a COLREG-based reward function, the high randomness in early-stage action selection can result in unpredictable strategy gradient updates, posing challenges for achieving model convergence. In response to this, we have introduced a dual COLREGs-compliant action selection strategy to effectively address the vessel avoidance problem by fully utilizing the seamanship.

Firstly, we have developed a COLREGs-compliant action evaluation network that employs a deep learning model trained on pre-recorded TS avoidance trajectories by USVs in adherence to COLREGs. Subsequently, we have proposed a COLREGs-compliant reward function-based action selection network that takes into account various TS encountering scenarios. As a result, the outputs of the two networks are fused to enable the selection of actions for cooperative path planning processes.

To highlight the differences between our method and previous DRL methods, we have revised the whole abstract of our manuscript to provide a clear overview of our study, results and the contributions.

The rewritten abstract is presented as below.

“Abstract: This research introduces a two-stage deep reinforcement learning approach for the cooperative path planning of Unmanned Surface Vehicles (USVs). The method is designed to address cooperative collision avoidance path planning while adhering to the International Regulations for Preventing Collisions at Sea (COLREGs) and considering the collision avoidance problem within the USV fleet and between USVs and Target Ships (TSs). To achieve this, the study presents a dual COLREGs-compliant action selection strategy to effectively manage the vessel avoidance problem. Firstly, we construct a COLREGs-compliant action evaluation network that utilizes a deep learning network trained on pre-recorded TS avoidance trajectories by USVs in compliance with COLREGs. Then, the COLREGs-compliant reward function-based action selection network is proposed by considering various TS encountering scenarios. Consequently, the results of the two networks are fused to select actions for cooperative path planning processes. The path planning model is established using the Multi-Agent Proximal Policy Optimization (MAPPO) method. The action space, observation space, and reward function are tailored for the policy network. Additionally, a TS detection method is introduced to detect the motion intentions of TSs. The study conducted Monte Carlo simulations to demonstrate the strong performance of the planning method. Furthermore, experiments focusing on COLREGs-based TS avoidance were carried out to validate the feasibility of the approach. The proposed TS detection model exhibited robust performance within the defined task”.

To provide the motivations for our study, we have analyzed the existing unresolved difficult issues or problems in the field of COLREGs-based path planning. That can distinguish our study from previous researches.

“

”.

Subsequently, we explicitly outlined the novelties of our research as follows:

“Subsequently, this research seeks to propose a two-stage Multi-Agent Reinforce-ment Learning (MARL) scheme based on the MAPPO algorithm, incorporating a cen-tralized training and decentralized execution strategy. The following innovations are presented:

We introduce a COLREGs-compliant action evaluation module to compute ac-tion probabilities that align with COLREGs regulations when encountering multiple TSs. The module parameters are learned from a dataset of pre-recorded USV trajectories. By fusing the probability vector and candidate action vector from the actor net-work, we select an action that is most feasible for encountered situation. Our reward function incorporates both COLREGs and seamanship considerations, providing a dual heuristic approach to guide the selection of COLREGs-compliant actions.
We propose a policy network that can handle multiple aggregation goals, obsta-cles, and dynamic TSs. To achieve this, we have defined the action space, observation space, and reward function for the policy network. Additionally, we have designed actor and critic networks.
A TS motion detection network is constructed to provide guidance for the deci-sion-making process of the MARL model.”.
To further substantiate our contributions, we have incorporated the latest advancements in COLREGs-based path planning to enrich our literature review.

The newly added references as listed below.

Maza, J. A. G.; Argüelles, R. P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 2022, 261, 112029.
Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.
Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A review on COLREGs-compliant navigation of autonomous surface vehicles: From traditional to learning-based approaches. Journal of Automation and Intelligence, 2022, 1(1), 100003.
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Networks: The Official Journal of the International Neural Network Society, 2022,152.
Li, L.; Wu, D.; Huang, Y.; Yuan, Z. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Applied Ocean Research, 2021, 113, 102759.

Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35, 24611-24624.
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent ship collision avoidance algorithm based on DDQN with prioritized experience replay under COLREGs. Journal of Marine Science and Engineering, 2022, 10(5), 585.
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. Ieee Access, 2020, 8, 165344-165364.
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. Journal of Marine Science and Technology, 2021, 26, 509-524.
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Engineering, 2020, 217, 107704.
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 2022, 10(7), 944.
Wei, G.; Kuo, W. COLREGs-compliant multi-ship collision avoidance based on multi-agent reinforcement learning technique. Journal of Marine Science and Engineering, 2022, 10(10), 1431.
We have added the reviews of the advancements in interpreting and implementing the COLREGs rules, as follows.

“The development on the International Regulations for Preventing Collisions at Sea –compliant (COLREGs-compliant) navigation is from two dimensions: the complexity of ship encounter scenarios and the evolution in methodologies.

There are mainly four classes of conventional methods that take great effort to accommodate COLREGs rules in path planning modules [9].

We have added some COLREGs-based collision avoidance methods via the deep reinforcement learning model to improve the literature review.

“In the paper [20], a multi-USV automatic collision avoidance method was employed based on a Double Deep Q network (DDQN) with prioritized experience replay. In the study [21], the PPO algorithm and hand-crafted a reward function are used to encourage the USV to comply with the COLREGs rules.

The study [24] employed a collision avoidance framework that divides all en-counter scenarios into seven types according to the avoidance constraints of the COLREGs for different encountered scenes.

We have incorporated the following sentences to offer the reader a vivid comprehension and impression of the deep reinforcement learning method.

“Different from traditional methods, as a paradigm in the field of Machine Learning (ML) to achieve multi-agent collaboration, the Deep Reinforcement Learning (DRL) model mainly studies the synchronous learning and evolution of agent strategy that is used to plan the coordinated movement of formation in real time [10]. By interacting with the environment, it continuously optimizes the agent's action strategy. The value function of different action strategies in the current state is estimated, and high-return actions are executed to avoid performing low-return or punitive actions. The deep network module of DRL is utilized to fit the motion model of USVs, enabling smooth control and avoiding falling into the local optimal solution [10]. DRL is well-suited for situations where the optimal decision-making strategy is not known beforehand and for dealing with non-stationary environments where the underlying dynamics may change over time [10]. These characteristics make it an effective and powerful tool for solving our task.

We have revised the description of the fundamental DRL methods.

“A vanilla policy gradient method, namely TRPO, ensures that each update to the policy parameters stays within a trusted region to prevent large parameter updates that might degrade the policy performance. It can handle high-dimensional state spaces [14]. However, it is relatively complicated, and is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks), and it has poor data efficiency. To address this problem, the Proximal Policy Optimization (PPO) method emerged. PPO converts the constraint term of TRPO into a penalty term, to decrease the complexity of con-strained optimization problems, and it uses only first-order optimization [15]. Multi-Agent PPO (MAPPO) is a version for the multi-agent partial observed Markov's decision-making process [19]. The Multi-Agent DDPG (MADDPG) method is another state-of-the-art method besides MAPPO. In MADDPG, each agent takes the other agents as part of the environment, and agents in the same region cooperate with each other to determine the optimal coordinated action [16]. However, it is hard to achieve the stability, due to the complexity of the hyperparameters”.

We have included a table to demonstrate the advantages and limitations of classic path re-planning methods and DRL methods.

“

Tbl.1. Introduction to comparative methods

	Method	Advantage	Limitation
Path re-planning methods	Rule based methods	COLREGs rules integrated into path re-planning	relying on hand-crafted design, hard to extended for complex ship encounter scenarios
	Hybrid methods: A*	fast, COLREGs-compliance incorporated into path re-planning	hard to extended for more complex ship encounter scenarios
	Reactive methods: Velocity obstacle	fast, COLREGs compliance enforced by integrating forbidden zones	accurate TS course information required
	Optimization-based methods	optimal, COLREGEs rules naturally formulated as constraints	relative high computation burden
DRL methods	Value-based methods: Q-learning, DQN	optimal policy derived from estimates of the optimal values of all different states	overestimation bias, high dimensionality, hard to strike a balance between exploration and exploitation
	Policy-based methods: MAPPO, TRPO	directly optimizing the policy without maintaining the value functions	careful design of reward functions required
	Actor–critic methods: MADDPG, MATD3,	explicit representation of both the policy and the value estimates	computationally expensive, hyperparameter sensitivity

”

Comment 2.

It was said that the USV action selection module will be developed according to COLREGs. For applications at sea, it is very important that COLREGs are used. 2.3. You can check it in section. This is referenced, but should be explicitly re-written in this paper.

The reasons are 1) Maza, J. A. G., & Argüelles, R. P. (2022). COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 261, 112029.

2) Kim, J. K., & Park, D. J. (2024). Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 159, 105894.

As stated in Please refer to this paper and supplement it.

Response. Thank you very much for you constructive comment. We totally agree your point of view that it is very important that COLREGs are used for applications at sea.

Based on your feedback, we have thoroughly reviewed Section 2.3 titled "COLREGs rules for Collision Avoidance" and have made revisions accordingly. We have added explicit details regarding the interpretation of the COLREGs rules that we utilized. Additionally, we have included information about the recommended reference and incorporated new references that are relevant to COLREGs-based path planning.

The references that we have recently added are listed below, and the recommended references are highlighted.
Maza, J. A. G.; Argüelles, R. P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 2022, 261, 112029.
Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.
Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A review on COLREGs-compliant navigation of autonomous surface vehicles: From traditional to learning-based approaches. Journal of Automation and Intelligence, 2022, 1(1), 100003.
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Networks: The Official Journal of the International Neural Network Society, 2022,152.
Li, L.; Wu, D.; Huang, Y.; Yuan, Z. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Applied Ocean Research, 2021, 113, 102759.

Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35, 24611-24624.
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent ship collision avoidance algorithm based on DDQN with prioritized experience replay under COLREGs. Journal of Marine Science and Engineering, 2022, 10(5), 585.
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. Ieee Access, 2020, 8, 165344-165364.
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. Journal of Marine Science and Technology, 2021, 26, 509-524.
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Engineering, 2020, 217, 107704.
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 2022, 10(7), 944.
Wei, G.; Kuo, W. COLREGs-compliant multi-ship collision avoidance based on multi-agent reinforcement learning technique. Journal of Marine Science and Engineering, 2022, 10(10), 1431.
The revised parts in Section 2.3, titled "COLREGs Rules for Collision Avoidance," is presented below.

“A USV may encounter situations of head-on, crossing and overtaking, as shown in Fig.2.

Referring to the studies [7], [8] and [9], the applicable COLREGs rules for this re-search are as follows:

Rule 14, head-on. When two power-driven vessels are meeting on reciprocal or nearly reciprocal courses so as to involve risk of collision each shall alter its course to starboard so that each shall pass on the port side of the other.

Rule 15, crossing situation. USV has the option to either stand on or give way to the TS. When two power-driven vessels are crossing so as to involve risk of collision, the vessel, which has the other on its own starboard side shall keep out of the way and shall, if the circumstances of the case admit, avoid crossing ahead of the other vessel.

Rule 16, action by give-way vessel. Every vessel, which is directed by these rules to keep out of the way of another vessel shall, so far as possible, take early and substantial action to keep well clear.

Rule 17, action by stand-on vessel. (i) where one of two vessels is to keep out of the way the other shall keep her course and speed. (ii) the latter vessel may however take action to avoid collision by its maneuver alone, as soon as it becomes apparent to it that the vessel required to keep out of the way is not taking appropriate action in compliance with these rules. When, from any cause, the vessel required to keep her course and speed finds itself so close that collision cannot be avoided by the action of the give-way vessel alone, it shall take such action as will best aid to avoid collision.

Rule 13, overtaking. In the COLREGs states that “any vessel overtaking any other shall keep out of the way of the vessel being overtaken”. The above description stipulates that it is the responsibility of the overtaking ship to avoid a collision but there is no clarity as to what action said ship should take to avoid a collision. Therefore, in the overtaking situation, we do not define a specific reward function but directly use the reward functions in the base layer to evaluate the avoidance actions of USV.

Most recent research has focus on addressing more complex scenarios, such as areas with restricted visibility that have obstructions and busy narrow channels governed by traffic separation scheme. These scenarios involve interactions with vessels that may not comply with COLREGs. For these scenarios, rules 2(b), 8, and 17 should be considered [9].

Rule 2(b), responsibility: under special circumstances, a departure from the rules may be made to avoid immediate danger.

Rule 8, actions to avoid collision: actions shall be made in ample time. If there is sufficient sea-room, alteration of course alone may be most effective. Reduce speed, stop or reverse if necessary. Action by a ship is required if there is a risk of collision, and when the ship has right-of-way.

Rule 17, actions by stand-on vessel: where one of two vessels is to keep out of the way, the other shall keep her course and speed. The latter vessel may however take action to avoid collision by her maneuver alone, as soon as it becomes apparent to her that the vessel required to keep out of the way is not taking appropriate action in compliance with these rules.

In Fig.2, the yellow area represents the situation where the ship is crossing from the port side of other ships, the blue area represents the starboard crossing situation, the red area represents the head-on situation, and the green area represents the over-taking situation where USV is being overtaken. Fig.2 a, b, c, d illustrate the actions that each vessel should take according to the COLREGs.

Fig.2. Illustrations of the COLREGs

Since encounters are dynamic situations, continuous monitoring is required. We assume that USVs remain vigilant of other vessels. If another vessel (including TSs or other USVs) does not comply with COLREGs, USV should make collision-avoidance actions in time”.

We have re-conducted collision avoidance experiments based on COLREGs to test the effectiveness of our proposed method in avoiding TSs in accordance with the COLREGs rules in Section 5.4 “COLREGs based Collision Avoidance Experiments”.
We have clearly outlined the TSs avoiding scenarios in Fig. 14 (original Fig. 11). Here are the specific details:

(a) (b)

Fig.14. Simulations of USVs avoiding TSs in accordance with COLREGs

”.

Comment 3.

5.4 In COLREGs based collision avoidance experiments, were head-on, crossing, and overtaking based on COLREGs? The picture is somewhat awkward. After checking what is a give-way ship and what is a stand-on ship, you need to check the ship actions.

Response. Thank you for your insightful comment. It is very important and helpful for enhancing the rigor of our COLREGs-compliant path planning experiment in Section 5.4 titled “COLREGs based Collision Avoidance Experiments”.

We apologize for the low readability of the original COLREGs-based collision avoidance experiment figures in our manuscript. The collision avoidance experiments based on COLREGs should distinctly illustrate the head-on, crossing, and overtaking scenarios.

In response to your comment, we have re-conducted collision avoidance experiments based on COLREGs to evaluate the efficacy of our proposed method in avoiding collisions within the USV fleet and between USVs and TSs, in Section 5.4 titled “COLREGs based Collision Avoidance Experiments”.

We have prominently depicted the TSs avoiding scenarios in Fig. 14 (formerly Fig. 11).

We have meticulously checked the results to ensure that the designations of give-way ship and stand-on ship conform to COLREGs regulations. As a result, we can affirm the correctness of the ship actions in accordance with COLREGs.

Here are the specific details:

We have re-conducted collision avoidance experiments based on COLREGs to test the effectiveness of our proposed method in avoiding TSs in accordance with the COLREGs rules in Section 5.4 titled “COLREGs based Collision Avoidance Experiments”.
We have clearly outlined the TSs avoiding scenarios in Fig. 14 (original Fig. 11).

(a) （b）

Fig.14. Simulations of USVs avoiding TSs in accordance with COLREGs

”.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Detailed remarks:

1. Please list the comparative methods briefly mentioned in the introduction section in a table, specifying the advantages and limitations of the methods, and whether and how the approach proposed in this paper eliminates these limitations.

2. Please compare in details the approach introduced in this paper with the mentioned Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method, that is used for comparative analysis in section 5 (Fig. 9 and 10).

3. Please specify which COLREGs rules are fulfilled by the approach proposed in this paper (e.g. Rule 8, Rule 13,14,15??) and how this is achieved?

4. Please specify the objective of the application of the TS Motion Detection Network, described in section 4. What is the reason to apply this in the path planning problem of USVs?

5. Please specify what COLREGs rules are fulfilled by the solution presented in Fig. 11 (similar question to question no. 3).

6. Please add a diagram showing the relation between the different components of the proposed approach, as listed in the Conclusion section: the cooperation module, the COLREGs based action choosing module and the TS detection module.

7. Please point out the main advantages of the proposed approach and the limitations in the Conclusion section (preferably bullet points)

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Please also considered the minor, mainly editing remarks marked in the attached file.

Author Response

Thank you very much for the careful reading of our manuscript and valuable suggestions. Please find the detailed responses below with figures and tables, and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. Based on the comments, we have made modifications on the revised manuscript. The changes to our manuscript within the document were also highlighted by using red text.

Response to the highlighted problem in the peer-review-33303472.v1.pdf file.

Response. Thank you for thoroughly reviewing our manuscript. Improving the rigor of the paper is crucial, and we greatly appreciate your attention to detail.

Comment 1. “Please briefly explain the vanilla policy gradient method of add a reference to the paper explaining this approach”.

Response. We apologize for the lack of explanation regarding the Vanilla Policy Gradient (VPG) method and the omission of a reference to the paper explaining this approach in our previous version. In response to the comment, we have now provided a brief explanation of the VPG method and included a reference to the relevant paper:

"The Vanilla Policy Gradient (VPG) method directly optimizes the policy by utilizing the gradient of the expected reward, which can lead to slow convergence [14]".

Additionally, to address the issue of slow convergence in VPG, we naturally employed the Trust Region Policy Optimization (TRPO) method. TRPO ensures that each update to the policy parameters remains within a trusted region, preventing large parameter updates that could potentially degrade the performance of the policy.

Comment 2. Incorrect format of trust region policy optimization (TRPO)

Response. We apologize for the incorrect format of the abbreviation used in our previous version. To correct this, we have updated the format to read "Trust Region Policy Optimization (TRPO)" in our revised manuscript. Additionally, we have proofread the entire paper to avoid similar situations and ensure the formatting is consistent and correct throughout the document.

Comment 3. For original expression “Literature [6] improves”, the reviewer suggests: Maybe instead "Literature", it could be better to use " In the paper [] ... " or "In the work []..."???.

Response. Thank you for your comment, and it is rather helpful to improve the readability of the paper.

According to the comment, we have made the following modifications to the expression.

“

In the work [6], the action space and reward function were improved by incorporating the Reciprocal Velocity Obstacle (RVO) scheme. Gate recurrent unit-based net-works were utilized to directly map the state of varying numbers of surrounding obstacles to corresponding actions.

In the paper [5], the collaborative path planning problem was modelled as a de-centralized part observable Markov decision-making process and devised an observation model, a reward function, and an action space suitable for the MAPPO algorithm for multi-target search tasks. In the research [26], a system was constructed to switch between path following and collision avoidance modes in real time, and the collision hazard was perceived through encounter identification and risk calculation.

In the paper [27], the Q learning method was applied for optimizing the state action pairs to obtain the initial strategy, then used PPO to fine-tune the strategy. In the research [28], agents were trained on a mixture of observations from different training environments and imposes linearity constraints on both the observation interpolations and the supervision (e.g. associated reward) interpolations. In the study [29], multiple targets were simultaneously optimized to enhance PPO

”.

Comment 4. The wrong spell of “CLOREGs” into “COLREGs”

Response. Thank you for your careful reading of our manuscript. We apologize for any spelling errors that may have occurred. The word "CLOREGs" is actually the terminology "COLREGs" which is an abbreviation for the International Regulations for Preventing Collisions at Sea. We have thoroughly proofread the entire document to ensure that no similar spelling error occurs.

Comment 5. The reviewer suggests that: “... Fig. 3. we denote ..." - we beginning with a small letter”

Response. Thank you for your careful reading of our paper, we greatly appreciate your attention to detail. We have corrected the format problem, and we proofread the entire document to ensure that no similar format error occurs. The relevant part is shown below.

“

As illustrated in Fig.4, we denote to be the component of on the direction perpendicular to the line , and denote C to be the intersection of the expected USV path with the safe circle of TS, if USV and TS keep their velocities

”.

Comment 6. The grammar mistakes of “an a USV”, “repetition of avoidance”, “a long with the decrement (decrease???) of the distance between the USV and obstacles”.

Response. We apologize for any grammar mistakes that may have caused difficulties in reading our manuscript. We have thoroughly reviewed and proofread the entire paper to ensure that such mistakes do not exist. The revised section is as follows:

“if the distance d_t of a USV to its nearest target”

“The obstacle avoidance reward consists of two parts, which are the static obstacle avoidance reward and the dynamic obstacle avoidance reward”

“The reward is a negative value that decreases exponentially along with the decrease of the distance between the USV and obstacles”.

Comment 7. Inconsistency between the citation format used in our manuscript and the requirements of the journal template.

Response. Thank you for bringing the formatting issue to our attention. We apologize for any inconvenience caused. Upon receiving your comment, we have thoroughly reviewed and corrected the format of all the references, to ensure they align with the appropriate journal format. We appreciate your understanding and attention to detail in helping us maintain the accuracy and professionalism of our work. The revised references are listed below.

“

Campbell, S.; Naeem, W.; Irwin, G.W. A review on improving the autonomy of unmanned surface vehicles through intelligent collision avoidance manoeuvres. Annual Reviews in Control. 2012, 36(2), 267-283.
Chakravarthy, A.; Ghose, D. Obstacle Avoidance in a Dynamic Environment: A Collision Cone Approach. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 1998, 28(5), 562-574.
Liang, X.; Qu, X.; Wang, N.; Li, Y.; Zhang, R. Swarm control with collision avoidance for multiple underactuated surface vehicles. Ocean Engineering. 2019, 191, 106516.
Liang, X.; Qu, X.; Wang, N.; Li, Y.; Zhang, R. A Novel Distributed and Self-Organized Swarm Control Framework for Underactuated Unmanned Marine Vehicles. IEEE Access. 2019, 7, 112703-112712.
Xia, J.; Luo, Y.; Liu, Z.; Zhang, Y.; Shi, H.; Liu, Z. Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning. Defence Technology, 2022, 29, 80-94.
Xue, D.; Wu, D.; Yamashita, A. S.; Li, Z. Proximal policy optimization with reciprocal velocity obstacle based collision avoidance path planning for multi-unmanned surface vehicles. Ocean Engineering, 2023, 273, 114005.
Maza, J. A. G.; Argüelles, R. P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 2022, 261, 112029.
Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.
Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A review on COLREGs-compliant navigation of autonomous surface vehicles: From traditional to learning-based approaches. Journal of Automation and Intelligence, 2022, 1(1), 100003.
Yang, Y.; Wang, J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv, 2020, 2011,
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B. Deep reinforcement learning: a survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Networks: The Official Journal of the International Neural Network Society, 2022,152.
Li, L.; Wu, D.; Huang, Y.; Yuan, Z. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Applied Ocean Research, 2021, 113, 102759.
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In International conference on machine learning. PMLR, Lille, France, 2015, 1889-1897.
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv, 2017, 1707, 06347.
Lowe, R.; Wu, Y. I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 2017, 30.
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods.In International conference on Continuous control with deep reinforcemmachine learning. PMLR, 2018, 1587-1596.
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv, 2015,1509.02971.
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35, 24611-24624.
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent ship collision avoidance algorithm based on DDQN with prioritized experience replay under COLREGs. Journal of Marine Science and Engineering, 2022, 10(5), 585.
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. Ieee Access, 2020, 8, 165344-165364.
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. Journal of Marine Science and Technology, 2021, 26, 509-524.
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Engineering, 2020, 217, 107704.
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 2022, 10(7), 944.
Wei, G.; Kuo, W. COLREGs-compliant multi-ship collision avoidance based on multi-agent reinforcement learning technique. Journal of Marine Science and Engineering, 2022, 10(10), 1431.
Rongcai, Z.; Hongwei, X.; Kexin, Y. Autonomous collision avoidance system in a multi-ship environment based on proximal policy optimization method. Ocean Engineering, 2023, 272, 113779.
Skrynnik, A.; Yakovleva, A.; Davydov, V.; Yakovlev, K.; Panov, A. I. Hybrid policy learning for multi-agent pathfinding. IEEE Access, 2021, 9, 126034-126047.
Wang, K.; Kang, B.; Shao, J.; Feng, J. Improving generalization in reinforcement learning with mixture regularization. Advances in Neural Information Processing Systems, 2020, 33, 7968-7978.
Khoi, N.D.H.; Van, C.P.; Tran, H.V.; Truong, C.D. Multi-Objective Exploration for Proximal Policy Optimization.In 2020 Applying New Technology in Green Buildings (ATiGB). IEEE, 2021, 105-109.
Tam, C.K.; Richard, B. Collision risk assessment for ships. Journal of Marine Science & Technology. 2010; 15, 257-70.
Statheros, T.; Howells, G.; Maier, K.M. Autonomous ship collision avoidance navigation concepts, technologies and techniques. The Journal of Navigation. 2008; 61, 129-42.
Wen, N.; Zhao, L.; Zhang, R.B.; Wang, S.; Liu, G.; Wu, J.; Wang, L. Online paths planning method for unmanned surface vehicles based on rapidly exploring random tree and a cooperative potential field. International journal of advanced robotic systems, 2022, 19(2), 1-22.
Shao ,Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. SeaShips: A Large-Scale Precisely Annotated Dataset for Ship Detection. In IEEE Transactions on Multimedia, 2018, 20(10), 2593-2604.
Kim, J.H.; Kim, N.; Park, Y.W.; Won, C.S. Object detection and classification based on YOLO-V5 with improved maritime dataset. Journal of Marine Science and Engineering, 2022, 10(3), 377.

”

Comment 1.

Please list the comparative methods briefly mentioned in the Introduction section in a table, specifying the advantages and limitations of the methods, and whether and how the approach proposed in this paper eliminates these limitations.

Response. Thank you for your insightful comment, it is valuable for improve the clarity and coherence of the paper. Based on your comment, we have made the following revisions to our manuscript:

In the Introduction section, we have included a table that briefly lists the comparative methods mentioned. This table now highlights the advantages and limitations of each method, providing readers with a comprehensive overview.

Following the comparative analysis, we have introduced our method as a means to address the limitations of existing approaches. Then, we have outlined the unique contributions and potential benefits of our proposed method at the end of the Introduction section.

These modifications aim to enhance the reasonability and readability of our paper, ensuring that readers can better understand the context and significance of our research. We sincerely appreciate your valuable input and guidance.

According to your comment, we have clarified the reason for choosing MAPPO algorithm as our basic planner in the Introduction section as follows.

“

Tbl.1. Introduction to comparative methods

	Method	Advantage	Limitation
Path re-planning methods	Rule based methods	COLREGs rules integrated into path re-planning	relying on hand-crafted design, hard to extended for complex ship encounter scenarios
	Hybrid methods: A*	fast, COLREGs-compliance incorporated into path re-planning	hard to extended for more complex ship encounter scenarios
	Reactive methods: Velocity obstacle	fast, COLREGs compliance enforced by integrating forbidden zones	accurate TS course information required
	Optimization-based methods	optimal, COLREGEs rules naturally formulated as constraints	relative high computation burden
DRL methods	Value-based methods: Q-learning, DQN	optimal policy derived from estimates of the optimal values of all different states	overestimation bias, high dimensionality, hard to strike a balance between exploration and exploitation
	Policy-based methods: MAPPO, TRPO	directly optimizing the policy without maintaining the value functions	careful design of reward functions required
	Actor–critic methods: MADDPG, MATD3,	explicit representation of both the policy and the value estimates	computationally expensive, hyperparameter sensitivity

”.

We also employed the comparative methods in the Introduction section as follows.

“The development on the International Regulations for Preventing Collisions at Sea–compliant (COLREGs-compliant) navigation is from two dimensions: the complexity of ship encounter scenarios and the evolution in methodologies.

The studies [7], [8], and [9] thoroughly analyzed and interpreted various scenarios in which Unmanned Surface Vehicles (USVs) may encounter other vessels during their navigation, as well as the COLREGs rules that all vessels must follow to avoid collision. The research [7] provides valuable insights into ship collision avoidance based on COLREG 72, which can be useful for Officers Of the Navigational Watch (OONW) both onboard and remotely, as well as for autonomous systems. The research is supported by examples drawn from various works that, despite their significant influence in cur-rent literature, may not have a correct standpoint. The studies [8] contributes to providing an understanding of COLREGs sailing rules based on the insights of navigators and researchers. In the study [9], the recent progress in COLREGs-compliant navigation of USVs from traditional to learning-based approaches is reviewed in depth.

There are mainly four classes of conventional methods that take great effort to accommodate COLREGs rules in path planning modules [9].

“DRL methods can be divided into three categories [11]. The Value-based methods, such as DQN-variants, estimate the optimal values of all different states, then derive the optimal policy using the estimated values [12, 13]. The policy-based methods, such as Trust Region Policy Optimization (TRPO) [14] and Proximal Policy Optimization (PPO) [15], optimize the policy directly without maintaining the value functions. The actor–critic methods, such as Deep Deterministic Policy Gradient (DDPG) [16], Twin Delayed DDPG (TD3) [17] and Soft Actor-Critic (SAC) [18], can be viewed as a combination of the above two methods, and they maintain an explicit representation of both the policy (the actor) and the value estimates (the critic).

The Vanilla Policy Gradient (VPG) method directly optimizes the policy by utilizing the gradient of the expected reward, which can lead to slow convergence [14]. In response to this challenge, TRPO ensures that each update to the policy parameters remains within a trusted region to prevent large parameter updates that could potentially degrade the performance of the policy [14]. Additionally, TRPO exhibits the capability to effectively handle high-dimensional state spaces. However, it is relatively complicated, and is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks), and it has poor data efficiency. To address this problem, the Proximal Policy Optimization (PPO) method emerged. PPO converts the constraint term of TRPO into a penalty term, to decrease the complexity of constrained optimization problems, and it uses only first-order optimization [15]. Multi-Agent PPO (MAPPO) is a version for the multi-agent partial observed Markov's decision-making process [19]. The Mul-ti-Agent DDPG (MADDPG) method is another state-of-the-art method besides MAP-PO. In MADDPG, each agent takes the other agents as part of the environment, and agents in the same region cooperate with each other to determine the optimal coordinated action [16]. However, it is hard to achieve the stability, due to the complexity of the hyperparameters”.

In addition, we have included the most recent references and provided concise descriptions of the studies.

The newly added references are listed below.

Maza, J. A. G.; Argüelles, R. P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 2022, 261, 112029.
Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.
Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A review on COLREGs-compliant navigation of autonomous surface vehicles: From traditional to learning-based approaches. Journal of Automation and Intelligence, 2022, 1(1), 100003.
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Networks: The Official Journal of the International Neural Network Society, 2022,152.
Li, L.; Wu, D.; Huang, Y.; Yuan, Z. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Applied Ocean Research, 2021, 113, 102759.

Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35, 24611-24624.
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent ship collision avoidance algorithm based on DDQN with prioritized experience replay under COLREGs. Journal of Marine Science and Engineering, 2022, 10(5), 585.
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. Ieee Access, 2020, 8, 165344-165364.
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. Journal of Marine Science and Technology, 2021, 26, 509-524.
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Engineering, 2020, 217, 107704.
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 2022, 10(7), 944.
Wei, G.; Kuo, W. COLREGs-compliant multi-ship collision avoidance based on multi-agent reinforcement learning technique. Journal of Marine Science and Engineering, 2022, 10(10), 1431.

The descriptions of the latest studies are as follows.

The study [24] employed a collision avoidance framework that divides all encounter scenarios into seven types according to the avoidance constraints of the COLREGs for different encountered scenes.

Subsequently, we outlined the novelties of our proposed method aiming at eliminating the limitations of comparative methods.

“Subsequently, this research seeks to propose a two-stage Multi-Agent Reinforcement Learning (MARL) scheme based on the MAPPO algorithm, incorporating a centralized training and decentralized execution strategy. The following innovations are presented:

We introduce a COLREGs-compliant action evaluation module to compute action probabilities that align with COLREGs regulations when encountering multiple TSs. The module parameters are learned from a dataset of pre-recorded USV trajectories. By fusing the probability vector and candidate action vector from the actor net-work, we select an action that is most feasible for encountered situation. Our reward function incorporates both COLREGs and seamanship considerations, providing a dual heuristic approach to guide the selection of COLREGs-compliant actions.
We propose a policy network that can handle multiple aggregation goals, obstacles, and dynamic TSs. To achieve this, we have defined the action space, observation space, and reward function for the policy network. Additionally, we have designed actor and critic networks.
A TS motion detection network is constructed to provide guidance for the decision-making process of the MARL model”.

Comment 2.

Please compare in details the approach introduced in this paper with the mentioned Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method that is used for comparative analysis in section 5 (Fig. 9 and 10).

Response. Thank you for your comment, it is highly valuable for improving the rigor of the paper.

Based on the comment, we conducted a comparative analysis of the MAPPO algorithm and the MADDPG method.
Additionally, we performed an experiment comparing MAPPO with the MATD3 method that is the latest version of MADDPG.
The average path length and average steering angles on the paths in the Monte Carlo simulations were also provided.
We rewrote the Section 5.3 titled “Path Planning Results” as follows.

“We training our planner for totally 1e+6 and 6e+6 iterations, respectively. We evaluate the network and record metric values per 5000 training iterations, and the evaluation reward curves are shown in Fig.11. The action space dimension is 7 and the observation space dimension is 24. Figures 12 and 13 depict the training reward curves of the MADDPG and Multi-Agent TD3 (MATD3) algorithms. MATD3 improves upon the original DDPG algorithm by incorporating additional techniques, such as using twin critic networks and delayed policy updates, to improve coordination and learning in complex environments. Due to the complexity of the training process and the long duration required for each iteration, we trained both the MADDPG and MATD3 algorithms for 1e+6 iterations.

(a) (b)

Fig.11. Reward curves of MAPPO-based planner. (a) reward curve of training for 1e+6 iterations; (b) reward curve of training for 6e+6 iterations.

Fig.12. Reward curve of MADDPG

Fig.13. Reward curve of TD3

”

Comment 3.

Please specify which COLREGs rules are fulfilled by the approach proposed in this paper (e.g. Rule 8, Rule 13,14,15??) and how this is achieved?

Response. Thank you for your insightful comment, it is valuable in improving the quality of our paper. In fact, we have fulfilled COLREGs rules 13, 14, 15, 16, and 17. To address more complex scenarios, we have also considered rules 2(b), 8, and 17.

To improve the readability of the paper, we have added descriptions of the COLREGs rules that our method mainly considers in section 2.3. The descriptions are as follows:

“A USV may encounter situations of head-on, crossing and overtaking, as shown in Fig.2.

Referring to the studies [7], [8] and [9], the applicable COLREGs rules for this re-search are as follows:

Rule 2(b), responsibility: under special circumstances, a departure from the rules may be made to avoid immediate danger.

Fig.2. Illustrations of the COLREGs

”

“Since encounters are dynamic situations, continuous monitoring is required. We assume that USVs remain vigilant of other vessels. If another vessel (including TSs or other USVs) does not comply with COLREGs, USV should make collision-avoidance actions in time”.

We have re-conducted collision avoidance experiments based on COLREGs to test the effectiveness of our proposed method in avoiding TSs in accordance with the COLREGs rules in Section 5.4 “COLREGs based Collision Avoidance Experiments”.
We have clearly outlined the TSs avoiding scenarios in Fig. 14 (original Fig. 11). Here are the specific details:

(a) （b）

Fig.14. Simulations of USVs avoiding TSs in accordance with COLREGs

”.

Comment 4.

Please specify the objective of the application of the TS Motion Detection Network, described in section 4. What is the reason to apply this in the path planning problem of USVs?

Response. Thank you for your comment, is very helpful in improving the validity of the paper.

Given that motion is a critical aspect of USV observation, we endeavored to detect the TS motion intention (including the course and bearing) using vision cameras. Consequently, for the comprehensiveness of the experiment, we introduced a TS motion detection network to inform the path planning of USVs.

In response to your valuable comment, we have revised the relevant Section 3.6 “COLREGs-compliant Action Evaluation Network Design” and added a figure to describe the purpose of incorporating a TS motion detection network. The revised section and accompanying diagram are as follows.

“

Fig.7. COLREGs-compliant action selection

In Fig. 7, the illustration depicts that the Gym environment supplies the state vectors of the entities to the path planning model. Additionally, the TS motion detection network offers TS motion information, which includes courses and bearings, to the planner. Subsequently, the state vector and observation vector are concatenated to form the input vector for both the actor network and the COLREGs-compliant action evaluation network.

By integrating the prediction vectors from both the actor and action evaluation networks, the policy network is capable of making well-informed decisions concerning the USV’s path-finding and collision-avoidance actions while following the COLREGs”.

Comment 5.

Please specify what COLREGs rules are fulfilled by the solution presented in Fig. 11 (similar question to question no. 3).

Response. Thank you for your valuable comment.

Based on that, we have re-conducted collision avoidance experiments to test the effectiveness of our proposed method in avoiding TSs in accordance with the COLREGs rules in Section 5.4 “COLREGs based Collision Avoidance Experiments”.
We have also clearly outlined the TSs avoiding scenarios in Fig. 14 (original Fig. 11). Here are the specific details:

Comment 6.

Please add a diagram showing the relation between the different components of the proposed approach, as listed in the Conclusion section: the cooperation module, the COLREGs based action choosing module and the TS detection module.

Response. Thank you for your valuable comment.

Based on your comment, we have added Fig. 7 to our paper to show the relationship between the different components of our proposed approach, as listed in the Conclusion section: the cooperation module, the COLREGs-based action choosing module, and the TS detection module.
Additionally, we have revised Figure 6 to clearly depict the data flow and illustrate the relationship between the different components of our proposed approach. The details are listed below.

“

Fig.6. Illustration of the Policy Network

Fig.7. COLREGs-compliant action selection

”

We have supplemented the description of Fig.7 to illustrate the relationship between the various components of our proposed approach.

“The COLREGs-compliant action evaluation network consists of three Fully Connected (FC) layers, with parameter sizes of 18x64, 64x64, and 64x7 for the respective layers. The Tanh activation function is applied after the first two FC layers, and the softmax function is used following the last FC layer to obtain the COLREGs-compliant probabilities of actions. During the pre-training procedure, the CrossEntropyLoss function is utilized. The remaining parameters, including the optimizer and learning rate, are set to be identical to those of the policy network. The pre-trained network is utilized directly without applying any parameter updates via the policy gradient algorithm.

The probability vector of actions compliant with COLREGs will be multiplied by the dot product with the probability vector of chosen actions to select the final action with the highest result.

In Fig. 7, the illustration depicts that the Gym environment supplies the state vectors of the entities to the path planning model. Additionally, the TS motion detec-tion network offers TS motion information, which includes courses and bearings, to the planner. Subsequently, the state vector and observation vector are concatenated to form the input vector for both the actor network and the COLREGs-compliant action evaluation network.

Comment 7.

Please point out the main advantages of the proposed approach and the limitations in the Conclusion section (preferably bullet points)

Response. Thank you for your valuable comment. We greatly appreciate your comment as it has helped us enhance the quality of the Conclusion section.

In accordance with your suggestion, we have revised the Conclusion section to highlight the main advantages of our proposed approach in bullet point. Additionally, we have acknowledged the limitation of this work, specifying that we have not yet translated the algorithm into physical USVs. The revised Conclusion section is as follows.

“This research proposes a two-stage path planning method for multiple USVs based on COLREGs. The method combines a cooperation module, a COLREGs-compliant action evaluation module, and a TS detection module. The cooperative path planning model for collision avoidance among USVs is constructed based on the MAPPO strategy, which utilizes a policy network capable of handling multiple aggregation goals, obstacles, and TSs. To achieve this, we define the action space, observation space, and reward function for the policy network, and design actor and critic networks.

Monte Carlo experimental results confirm the effectiveness and efficiency of our path planning method for formation aggregation and collision avoidance. We conducted these experiments by randomly specifying the positions of USVs and obstacles. This approach allowed us to evaluate the performance of our method in diverse scenarios and validate its robustness.
We benchmarked the simulation results against the MADDPG and MATD3 methods to validate the efficiency and the optimization performance of our approach.

We conducted further experiments to test the feasibility of our collision avoidance scheme based on COLREGs within the USV fleet, as well as between USVs and TSs. We were able to confirm the practicality and effectiveness of our method in a realistic scenario.

The TS detection network is constructed based on the YOLO network and the squeeze-and-excitation scheme.

Our proposed TS detection model performs well in our specific environment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

corles is wrong.

what is the meaning of corleg in abstract?

contributions compared to to the state of the arts in introduction are not clear.

this paper is lack of literature review on collision avoidance path planner without reinforcement learning(RL).

comparison with state of the art is missing in simulation section.

RL is not suitable for handling unknown environments which are not trained before.

contribution of using RL is not clear.

simulation environment is too simple.

circle obstacles are too simple.

can you randomly generate obstacles while changing the number of ships?

Comments on the Quality of English Language

Author Response

Reviewer 4

The changes to our manuscript within the document were also highlighted by using red text.

Comments 1 and 2.

Corles is wrong. 2. what is the meaning of corleg in abstract?

Response. Thank you for your careful reading of our manuscript. We apologize for any spelling errors that may have occurred. We greatly appreciate your attention to detail. The words "Corles", “corleg” are actually the terminology "COLREGs" which is an abbreviation for the International Regulations for Preventing Collisions at Sea. We have corrected the abbreviation and thoroughly proofread the entire document to ensure that no similar spelling errors have occurred.

In fact, we have fulfilled COLREGs rules 13, 14, 15, 16, and 17. To address more complex scenarios, we have also considered rules 2(b), 8, and 17.

According to you comment, we have added descriptions of the COLREGs rules that our method mainly considers in Section 2.3 “COLREGs Rules for Collision Avoidance”, to improve the readability of the paper. The descriptions are as follows:

“A USV may encounter situations of head-on, crossing and overtaking, as shown in Fig.2.

Referring to the studies [7], [8] and [9], the applicable COLREGs rules for this re-search are as follows:

Rule 2(b), responsibility: under special circumstances, a departure from the rules may be made to avoid immediate danger.

Fig.2. Illustrations of the COLREGs

”

We have re-conducted collision avoidance experiments based on COLREGs to test the effectiveness of our proposed method in avoiding TSs in accordance with the COLREGs rules in Section 5.4 “COLREGs based Collision Avoidance Experiments”.
We have clearly outlined the TSs avoiding scenarios in Fig. 14 (original Fig. 11). Here are the specific details:

(a) (b)

Fig.14. Simulations of USVs avoiding TSs in accordance with COLREGs

”.

Comment 3.

contributions compared to the state of the arts in introduction are not clear.

Response. Thank you very much for your constructive commend, and pivotal in enhancing the literature review and effectively improve the presenting of the novelty of the work.

According to the comment, we have carefully revised the Introduction section and clearly presented the contribution of the work by comparing to the state-of-the-art methods.

We have also referenced the latest contributions in the field of COLREGs-based path planning to enhance our literature review. Details are as follows.

The newly added references as listed below.

Maza, J. A. G.; Argüelles, R. P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Engineering, 2022, 261, 112029.
Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.
Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A review on COLREGs-compliant navigation of autonomous surface vehicles: From traditional to learning-based approaches. Journal of Automation and Intelligence, 2022, 1(1), 100003.
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Networks: The Official Journal of the International Neural Network Society, 2022,152.
Li, L.; Wu, D.; Huang, Y.; Yuan, Z. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Applied Ocean Research, 2021, 113, 102759.

17 Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In International conference on Continuous control with deep reinforcement machine learning. PMLR, 2018, 1587-1596.

Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35, 24611-24624.
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent ship collision avoidance algorithm based on DDQN with prioritized experience replay under COLREGs. Journal of Marine Science and Engineering, 2022, 10(5), 585.
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. Ieee Access, 2020, 8, 165344-165364.
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. Journal of Marine Science and Technology, 2021, 26, 509-524.
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Engineering, 2020, 217, 107704.
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 2022, 10(7), 944.
Wei, G.; Kuo, W. COLREGs-compliant multi-ship collision avoidance based on multi-agent reinforcement learning technique. Journal of Marine Science and Engineering, 2022, 10(10), 1431.
We have added the reviews of the advancements in interpreting and implementing the COLREGs rules, as follows.

“

There are mainly four classes of conventional methods that take great effort to accommodate COLREGs rules in path planning modules [9].

”.

We have incorporated the following sentences to offer the reader a vivid comprehension and impression of the deep reinforcement learning method.

“

DRL methods can be divided into three categories [11]. The Value-based methods, such as DQN-variants, estimate the optimal values of all different states, then derive the optimal policy using the estimated values [12, 13]. The policy-based methods, such as Trust Region Policy Optimization (TRPO) [14] and Proximal Policy Optimization (PPO) [15], optimize the policy directly without maintaining the value functions. The actor–critic methods, such as Deep Deterministic Policy Gradient (DDPG) [16], Twin Delayed DDPG (TD3) [17] and Soft Actor-Critic (SAC) [18], can be viewed as a combination of the above two methods, and they maintain an explicit representation of both the policy (the actor) and the value estimates (the critic)

”.

We have revised the description of the fundamental DRL methods.

“

”.

We have added some COLREGs-based collision avoidance methods via the deep reinforcement learning model to improve the literature review.

“

The study [24] employed a collision avoidance framework that divides all en-counter scenarios into seven types according to the avoidance constraints of the COLREGs for different encountered scenes.

”.

We have included a table to demonstrate the advantages and limitations of classic path re-planning methods and DRL methods.

“

Tbl.1. Introduction to comparative methods

	Method	Advantage	Limitation
Path re-planning methods	Rule based methods	COLREGs rules integrated into path re-planning	relying on hand-crafted design, hard to extended for complex ship encounter scenarios
	Hybrid methods: A*	fast, COLREGs-compliance incorporated into path re-planning	hard to extended for more complex ship encounter scenarios
	Reactive methods: Velocity obstacle	fast, COLREGs compliance enforced by integrating forbidden zones	accurate TS course information required
	Optimization-based methods	optimal, COLREGEs rules naturally formulated as constraints	relative high computation burden
DRL methods	Value-based methods: Q-learning, DQN	optimal policy derived from estimates of the optimal values of all different states	overestimation bias, high dimensionality, hard to strike a balance between exploration and exploitation
	Policy-based methods: MAPPO, TRPO	directly optimizing the policy without maintaining the value functions	careful design of reward functions required
	Actor–critic methods: MADDPG, MATD3,	explicit representation of both the policy and the value estimates	computationally expensive, hyperparameter sensitivity

”

To provide the motivations for our study, we have analyzed the existing unresolved difficult issues or problems in the field of COLREGs-based path planning as follows.

“

The majority of current DRL methods focuses on generating COLREGs-compliant paths using a COLREG-based reward function. However, the high randomness in early-stage action selection can lead to unpredictable strategy gradient updates, making it challenging to achieve model convergence.

”.

Subsequently, we explicitly outlined the novelties of our research as follows:

“

We introduce a COLREGs-compliant action evaluation module to compute action probabilities that align with COLREGs regulations when encountering multiple TSs. The module parameters are learned from a dataset of pre-recorded USV trajectories. By fusing the probability vector and candidate action vector from the actor net-work, we select an action that is most feasible for encountered situation. Our reward function incorporates both COLREGs and seamanship considerations, providing a dual heuristic approach to guide the selection of COLREGs-compliant actions.
We propose a policy network that can handle multiple aggregation goals, obstacles, and dynamic TSs. To achieve this, we have defined the action space, observation space, and reward function for the policy network. Additionally, we have designed actor and critic networks.
A TS motion detection network is constructed to provide guidance for the decision-making process of the MARL model.

”.

Comment 4.

this paper is lack of literature review on collision avoidance path planner without reinforcement learning(RL).

Response. Thank you very much for your helpful commend, and it is important in enhancing the literature review and effectively improve the presenting of the novelty of the work.

According to the comment, we have added the literature review on collision avoidance path planner without reinforcement learning (RL), as follows:

“There are mainly four classes of conventional methods that take great effort to accommodate COLREGs rules in path planning modules [9].

“

Tbl.1. Introduction to comparative methods

	Method	Advantage	Limitation
Path re-planning methods	Rule based methods	COLREGs rules integrated into path re-planning	relying on hand-crafted design, hard to extended for complex ship encounter scenarios
	Hybrid methods: A*	fast, COLREGs-compliance incorporated into path re-planning	hard to extended for more complex ship encounter scenarios
	Reactive methods: Velocity obstacle	fast, COLREGs compliance enforced by integrating forbidden zones	accurate TS course information required
	Optimization-based methods	optimal, COLREGEs rules naturally formulated as constraints	relative high computation burden
DRL methods	Value-based methods: Q-learning, DQN	optimal policy derived from estimates of the optimal values of all different states	overestimation bias, high dimensionality, hard to strike a balance between exploration and exploitation
	Policy-based methods: MAPPO, TRPO	directly optimizing the policy without maintaining the value functions	careful design of reward functions required
	Actor–critic methods: MADDPG, MATD3,	explicit representation of both the policy and the value estimates	computationally expensive, hyperparameter sensitivity

The comparative methods are outlined in Tab.1.

”

Comment 5.

comparison with state of the art is missing in simulation section.

Response. Thank you for your comment, it is highly valuable for improving the rigor of the paper.

Based on the comment, we conducted a comparative analysis of the MAPPO algorithm and the MADDPG method.
Additionally, we performed an experiment comparing MAPPO with the MATD3 method that is the latest version of MADDPG.
The average path length and average steering angles on the paths in the Monte Carlo simulations were also provided.
We rewrote the Section 5.3 titled “Path Planning Results” as follows.

(a) (b)

Fig.11. Reward curves of MAPPO-based planner. (a) reward curve of training for 1e+6 iterations; (b) reward curve of training for 6e+6 iterations.

Fig.12. Reward curve of MADDPG

Fig.13. Reward curve of TD3

”

Comment 6.

RL is not suitable for handling unknown environments which are not trained before.

Response. Thank you for your insightful comment. We acknowledge the challenge of generalization in reinforcement learning (RL), especially when encountering unknown, untrained environments.

To tackle this issue, we aim to leverage knowledge acquired from related tasks or environments. Our approach involves the use of a deep network for evaluating COLREGs-compliant actions, guiding the RL training process through seamanship principles. We have developed a dataset to train the network, enabling it to effectively utilize seamanship knowledge and enhance the training process in new environments.

In response to this feedback, we have revised Section 3.6 "COLREGs-compliant Action Evaluation Network Design" to provide further clarity on our approach. The newly added Fig. 7 and relevant descriptions explicitly outline the role of the COLREGs-compliant action evaluation network in action selection. For more detailed information, please refer to the following section.

“We used a pre-training strategy, training the COLREGs-compliant action evaluation network separately from the other parts of the policy network. The COLREGs-compliant action evaluation network is pre-trained using pre-recorded USV trajectory data, which includes the state vector and global observation vector, as well as the USV actions for avoiding the TSs. The action labels align with our action space. The final established dataset consists of 5000 data items, with an almost equal number of avoidance behaviors for each category (head-on, crossing and give-way, crossing and stand-on, and overtaking), ensuring a balanced dataset.

Fig.7. COLREGs-compliant action selection

The COLREGs-compliant action evaluation network consists of three Fully Connected (FC) layers, with parameter sizes of 18x64, 64x64, and 64x7 for the respective layers. The Tanh activation function is applied after the first two FC layers, and the softmax function is used following the last FC layer to obtain the COLREGs-compliant probabilities of actions. During the pre-training procedure, the CrossEntropyLoss function is utilized. The remaining parameters, including the optimizer and learning rate, are set to be identical to those of the policy network. The pre-trained network is utilized directly without applying any parameter updates via the policy gradient algorithm.

The probability vector of actions compliant with COLREGs will be multiplied by the dot product with the probability vector of chosen actions to select the final action with the highest result.

Based on your comment, we conducted additional experiments to compare the performance of our proposed method with two traditional deep reinforcement learning methods: MADDPG and MATD3. The results of this comparison demonstrate that our proposed method exhibits significantly higher efficiency. For more detailed information, please refer to the response to Comment 5.

Comment 7.

contribution of using RL is not clear.

Response. Thank you for your helpful comment, and we apologize for not clearly describing the contribution of using RL in our previous manuscript. The significant contributions of utilizing the RL method is in solving the complex COLREGs-compliant path planning problem in multi-ship encounter scenarios. We chose the RL method for solving our task due to the following reasons.

Traditional non-computer learning methods often struggle with these challenges due to their complexity. Previous research has mainly focused on simple 1-1 ship encounters dictated by rules 13–16 in Part B of COLREGs, but the USV path planning method becomes much more challenging when encountering multiple TSs.

The RL approaches are effective at handling complex decision-making problems in real-world environments, where traditional approaches may not be feasible or effective. The method can also handle situations where the optimal decision-making strategy is not known beforehand. By gradually adjusting its behavior based on this feedback, the agent can converge towards an optimal decision-making strategy over time. Furthermore, RL is capable of adapting to non-stationary environments, where the underlying dynamics of the environment may change over time. RL algorithms can adjust to these changes and continue to make effective decisions, making it a powerful tool for solving complex decision-making problems across a variety of domains.

Subsequently, we have incorporated descriptions of traditional methods in the Introduction section of our revised manuscript.

The revised parts in the Introduction section is presented below:

“There are mainly four classes of conventional methods that take great effort to accommodate COLREGs rules in path planning modules [9].

Furthermore, we have provided an explanation for selecting RL as our fundamental path planning method.

We have included a table to demonstrate the advantages and limitations of classic path re-planning methods and DRL methods.

“

Tbl.1. Introduction to comparative methods

	Method	Advantage	Limitation
Path re-planning methods	Rule based methods	COLREGs rules integrated into path re-planning	relying on hand-crafted design, hard to extended for complex ship encounter scenarios
	Hybrid methods: A*	fast, COLREGs-compliance incorporated into path re-planning	hard to extended for more complex ship encounter scenarios
	Reactive methods: Velocity obstacle	fast, COLREGs compliance enforced by integrating forbidden zones	accurate TS course information required
	Optimization-based methods	optimal, COLREGEs rules naturally formulated as constraints	relative high computation burden
DRL methods	Value-based methods: Q-learning, DQN	optimal policy derived from estimates of the optimal values of all different states	overestimation bias, high dimensionality, hard to strike a balance between exploration and exploitation
	Policy-based methods: MAPPO, TRPO	directly optimizing the policy without maintaining the value functions	careful design of reward functions required
	Actor–critic methods: MADDPG, MATD3,	explicit representation of both the policy and the value estimates	computationally expensive, hyperparameter sensitivity

”

We have analyzed the existing unresolved difficult issues or problems in the field of COLREGs-based path planning. Then the significant contributions of RL in our complex decision task was provided.

“

“The majority of current DRL methods focuses on generating COLREGs-compliant paths using a COLREG-based reward function. However, the high randomness in early-stage action selection can lead to unpredictable strategy gradient updates, making it challenging to achieve model convergence.

Based on the provided comment, we conducted a comparative experiment of the MAPPO algorithm, the MADDPG method and the MATD3 method that is the latest version of MADDPG.

In the Monte Carlo simulations, we measured the average path length and average steering angles on the paths. Based on these findings, we have selected MAPPO as our fundamental deep reinforcement learning method. For more information, please refer to the response to the comment 5.

Comment 8.

simulation environment is too simple.

Response. Thank you for your insightful comment. It is helpful in enhancing the credibility of our experiments. We fully agree with your point of view that our original simulation environment is too simple.

In response to your comment, we have re-conducted collision avoidance experiments in Section 5.4 titled “COLREGs based Collision Avoidance Experiments”, incorporating different shapes of obstacles and multiple vessel encountering scenarios.

The revised environment includes multiple aggregating targets, varied vessel encounter situations, and obstacles of different shapes, with randomized positions for USVs, TSs, and aggregation targets. These modifications aim to test the effectiveness of our proposed method in adhering to the COLREGs rules when addressing complex encountering scenarios involving Target Ships (TSs). The updated Section 5.4 and the details of the new experiment are presented below.

(a) (b)

Fig.14. Simulations of USVs avoiding TSs in accordance with COLREGs

”

We conducted a simple Monte Carlo simulation, as shown in Figure 10 (previously Figure 8), with randomly positioned USVs, obstacles, and aggregation targets to test the convergences of our proposed method and comparative methods. We compared reward curves of our method with those of comparative methods, namely MADDPG and MATD3. Additionally, we collected and compared average path lengths and steering angles of the paths to evaluate the optimization capability of these methods.

Comment 9.

circle obstacles are too simple.

Response. Thank you for your insightful comment, it is helpful to enhance the credibility of our experiments. We totally agree with your statement that in practical marine environments, obstacles often have irregular shapes that cannot be simply represented by circles.

Based on the comment, we have re-conducted collision avoidance experiments in Section 5.4 titled “COLREGs based Collision Avoidance Experiments”, incorporating different shapes of obstacles and multiple vessel encountering scenarios.

The revised environment includes multiple aggregating targets, varied vessel encounter situations, and obstacles of different shapes, with randomized positions for USVs, TSs, and aggregation targets.

These modifications can test the effectiveness of our proposed method in adhering to the COLREGs rules when addressing complex encountering scenarios involving Target Ships (TSs). For more information, please refer to the response to comment 8.

Obstacles in practical marine environments are often irregularly-shaped. In our COLREGs-compliant path planning problem, we addressed this issue by computing collision situations between USVs and obstacles based on the minimum distance from the USV to the face of the obstacle. We achieved this by drawing consecutive lines from the center of the USV and calculating the distance from the USV to the closest boundary of the obstacle. The shortest distance obtained in this manner was used to assess the possibility of collision. This approach is likely to the working principle of radar and can be implemented in practical scenarios.

To handle irregular-shaped obstacles, we decomposed them into smaller parts, each of which can be represented using regular-shaped bounding boxes such as circles, rectangles, triangles, or ovals. By employing the aforementioned method, we can calculate the collision situations when planning paths, taking into account the presence of irregular-shaped obstacles.

Then, we have developed a reinforcement learning (RL)-based method specifically designed to address the challenging dynamic scenarios of collision avoidance within a fleet of Unmanned Surface Vehicles (USVs), as well as between USVs and target ships or obstacles.

Comment 10.

can you randomly generate obstacles while changing the number of ships?

Response. Thank you for your insightful feedback.

To enhance the variability of our simulation, we implemented a random configuration for the shapes and quantities of obstacles and target ships each time the simulation was reset. When defining the observation space of USVs, we have imposed limitations on the maximum number of target ships (up to three) and the maximum number of obstacles (up to four). The model can be extended to accommodate additional entities. The positions of obstacles, USVs, and target ships were also randomly set.

Specifically, Figure 14a depicts a scenario involving two TSs, while Figure 14b illustrates a scenario with three TSs.

Taking your feedback into account, we have carried out supplementary collision avoidance experiments in accordance with the COLREGs rules, as detailed in Section 5.4 titled "COLREGs based Collision Avoidance Experiments." These experiments serve to demonstrate the efficacy of our proposed method in evading varying numbers of TSs based on real-time environmental information. We have included these avoidance scenarios in Figure 14 (formerly Figure 11). For more details, please refer to the response to comment 8.
In Section 3.1, titled "Observation Space Design," we presented the idea that the observation vector can comprise three USVs and three TSs. The relevant description is as follows.

“The variables Δv_x and Δv_y represent the difference between the velocities of USV_i and USV_j. Specifically, this equation considers three USVs, four obstacles, and three TSs in the vicinity of a USV”.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Author Response

We appreciate the positive comments from the reviewer. Thank you very much for your encouraging feedback on our manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

Many parts have been appropriately modified. However, a comprehensive review of COLREGs papers is lacking. Please refer to the papers below to reinforce the need for an introduction.

1. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm

2. Determining the proper times and sufficient actions for the collision avoidance of navigator-centered ships in the open sea using artificial neural networks

3. Modeling evasive action to be implemented at the minimum distance for collision avoidance in a give-way situation

4. Safety and COLREG evaluation for marine collision avoidance algorithms.

Author Response

Thank you very much for the careful reading of our manuscript and valuable suggestions. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. Based on the comments, we have made modifications on the revised manuscript. Detailed revisions are shown as follows. The changes to our manuscript within the document are also highlighted by using red text.

Comment

Many parts have been appropriately modified. However, a comprehensive review of COLREGs papers is lacking. Please refer to the papers below to reinforce the need for an introduction.

Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm
Determining the proper times and sufficient actions for the collision avoidance of navigator-centered ships in the open sea using artificial neural networks
Modeling evasive action to be implemented at the minimum distance for collision avoidance in a give-way situation
Safety and COLREG evaluation for marine collision avoidance algorithms.

Response. Thank you for your encouraging feedback on our manuscript, it is very important to elevate the professionalism and advancement of our paper.

Based on your comment, we have enhanced the review of COLREGs papers by carefully considering the recommended outstanding studies. We have also taken note that we have previously referred to the literature of "Understanding of sailing rules based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm" by Kim and Park in our previous manuscript. Therefore, in our revised version, we have included the other three studies, further strengthening the significance of the introduction.

We have revised the relevant section to highlight the advancement of research on ship collision avoidance through the presentation of systematic approaches and the provision of insights into the interpretation of COLREGs rules.

Additionally, we have thoroughly checked the references throughout the entire paper to ensure accuracy.

The recommended literature that has already been incorporated is listed below.

“

Kim, J. K.; Park, D. J. Understanding of sailing rule based on COLREGs: Comparison of navigator survey and automated collision-avoidance algorithm. Marine Policy, 2024, 159, 105894.

”

The newly added references are as follows:

“

Yim, J. B; Park, D. J.. Modeling evasive action to be implemented at the minimum distance for collision avoidance in a give-way situation. Ocean Engineering. 2022, 263, 112210.
Kim, J. K.; Park, D. J. Determining the Proper Times and Sufficient Actions for the Collision Avoidance of Navigator-Centered Ships in the Open Sea Using Artificial Neural Networks. Journal of Marine Science and Engineering, 2023, 11(7), 1384.
Hagen, I. B.; Vassbotn, O.; Skogvold, M.; Johansen, T. A.; Brekke, E. F.. Safety and COLREG evaluation for marine collision avoidance algorithms. Ocean Engineering, 2023, 288, 115991.

”

The corresponding parts that have been rewritten in the Introduction section are as follows:

“

The studies [7-12] contribute to the advancement of research on ship collision avoidance by presenting systematic approaches and providing insights into the interpretation of COLREGs rules. Research [7] offers valuable insights into ship collision avoidance based on COLREG 72, which can be useful for Officers Of the Navigational Watch (OONW) both onboard and remotely, as well as for autonomous systems. The research is supported by examples drawn from various works that, despite their significant influence in current literature, may not have a correct standpoint. In study [8], Kim and Park provide insights into COLREGs sailing rules based on the perspectives of navigators and researchers. In study [9], the recent progress in COLREGs-compliant navigation of USVs from traditional to learning-based approaches is reviewed in depth.

In study [10], Yim and Park presented a systematic approach to model evasive ac-tion aimed at preventing collisions in a give-way situation at the minimum distance moment. The researchers established a conceptual framework for such evasive action and identified COLREGs-compliant maneuvers through a simulation based on ship-handling scenarios. In research [11], Kim and Park proposed a method for deter-mining the appropriate timing and necessary actions to ensure ship collision avoidance in accordance with COLREGs rules, using Bayesian regularized Artificial Neural Net-works (BRANNs). In paper [12], Hagen et al. expressed the COLREGs rules mathematically, providing insights into their interpretation through the selection of parameters and weights

”.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

I am satisfied with revision

Comments on the Quality of English Language

Author Response

We appreciate the positive comments from the reviewer. Thank you very much for your encouraging feedback on our manuscript. Based on your comment, we have carefully polished the entire manuscript, hoping that our revisions meet the requirements for publication. Please find the corresponding revisions/corrections highlighted/in track changes in the re-submitted files.

Article Menu

COLREGs-Based Path Planning for USVs Using the Deep Reinforcement Learning Strategy

Comment 1.

Comment 2.

Comment 3.

Comment 4.

Comment 5.

Comment 7.

Comment 8.

Comment 9.

Comment 10.

Comment 11.

Reviewer 2

Comment 1.

Comment 2.

Comment 3.

Comment 1.

Comment 2.

Comment 3.

Comment 4.

Comment 5.

Comment 6.

Comment 7.

Reviewer 4

Comments 1 and 2.

Comment 3.

Comment 4.

Comment 5.

Comment 6.

Comment 7.

Comment 8.

Comment 9.

Comment 10.

Further Information

Guidelines

MDPI Initiatives

Follow MDPI