1. Introduction
The key to the unmanned surface vehicle (USV) cooperative path-planning problem is the adaptive selection of the best collision-avoidance action [
1]. In the procedure, the sub-problems below are predominantly considered, i.e., cooperatively avoiding static obstacles, avoiding collision within the USV fleet and between USVs and target ships (TSs) [
2]. Several challenges are posed to the planning process. High-dimensional action and state spaces present a major difficulty to the efficiency of the online planner that is required to respond to various scenarios in time [
3,
4]. Meanwhile, the dynamic TS avoidance problem is complicated since it can greatly increase the dimension of the planning space [
5,
6].
The development of the International Regulations for Preventing Collisions at Sea (COLREGs)-compliant navigation has been in two areas: the complexity of ship encounter scenarios and the evolution in methodologies.
Studies [
7,
8,
9,
10,
11,
12] have contributed to the advancement in research on ship collision avoidance by presenting systematic approaches and providing insights into the interpretation of COLREGs rules. Research described in [
7] offers valuable insights into ship collision avoidance based on COLREG 72, which can be useful for Officers of the Navigational Watch (OONW), both onboard and remotely, as well as for autonomous systems. However, the research is supported by examples drawn from various works that, despite their significant influence in current literature, may not be from the correct standpoint. In [
8], Kim and Park provide insights into COLREGs sailing rules based on the perspectives of navigators and researchers. In [
9], the recent progress in COLREGs-compliant navigation of USVs from traditional to learning-based approaches is reviewed in depth.
In [
10], Yim and Park presented a systematic approach to model evasive action aimed at preventing collisions in a give-way situation at the minimum-distance moment. The researchers established a conceptual framework for such evasive action and identified COLREGs-compliant maneuvers through a simulation based on ship-handling scenarios. In [
11], Kim and Park proposed a method for determining the appropriate timing and necessary actions to ensure ship collision avoidance in accordance with COLREGs rules, using Bayesian-regularized artificial neural networks (BRANNs). In [
12], Hagen et al. expressed the COLREGs rules mathematically, providing insights into their interpretation through the selection of parameters and weights.
There are mainly four classes of conventional methods that make a great effort to accommodate COLREGs rules in their path-planning modules [
9].
Rule-based methods involve hand-crafted designs and focus on simple ship-encounter scenarios and, hence, cannot be easily extended to more complex ship-encounter scenarios. Hybrid methods, such as A*-variants and rapidly exploring random-tree variants, suffer from the complexity of the multi-TS-avoiding problem and the high dimension of the multi-USV planning space. Reactive methods, such as the artificial potential field and velocity obstacle methods, have difficulties in uncertain TS course prediction. Optimization-based methods are also hard to conduct in complex TS encounter scenarios.
Additionally, traditional research usually focuses on simple 1–1 ship-encounter scenarios dictated by rules 13–16 in Part B of the COLREGs. However, the USV path-planning method becomes more challenging when considering multi-TS encounter scenarios. When planning those algorithms in simulated or real scenarios, simplified and basic assumptions of COLREGs are far from its real complexity [
7,
8].
Therefore, traditional methods are not able to fully use the seamanship of experienced mariners to solve complex situations, and they can hardly be considered powerful nonlinear approximators of the optimal value and policy functions.
Different from traditional methods, as a paradigm in the field of machine learning (ML) to achieve multi-agent collaboration, the deep reinforcement learning (DRL) model mainly studies the synchronous learning and evolution of agent strategies that are used to plan the coordinated movement of formation in real time [
13]. By interacting with the environment, it continuously optimizes the agent’s action strategy. The value function of different action strategies in the current state is estimated, and high-return actions are executed to avoid performing low-return or punitive actions. The deep network module of DRL is utilized to fit the motion model of USVs, enabling smooth control and avoiding falling into the local optimal solution [
13]. DRL is well suited for situations where the optimal decision-making strategy is not known beforehand and for dealing with non-stationary environments where the underlying dynamics may change over time [
13]. These characteristics make it an effective and powerful tool for our task.
DRL methods can be divided into three categories [
14]. The value-based methods, such as deep Q-network variants, estimate the optimal values of all different states and then derive the optimal policy using the estimated values [
15,
16]. The policy-based methods, such as trust-region policy optimization (TRPO) [
17] and proximal policy optimization (PPO) [
18], optimize the policy directly without maintaining the value functions. Actor–critic methods, such as the deep deterministic policy gradient (DDPG) [
19], twin-delayed DDPG (TD3) [
20], and soft actor–critic (SAC) [
21], can be viewed as a combination of the above two methods, and they maintain an explicit representation of both the policy (the actor) and the value estimates (the critic).
The vanilla policy gradient (VPG) method directly optimizes the policy by utilizing the gradient of the expected reward, which can lead to slow convergence [
17]. In response to this challenge, TRPO ensures that each update to the policy parameters remains within a trusted region to prevent large parameter updates that could potentially degrade the performance of the policy [
17]. Additionally, TRPO exhibits the capability of effectively handling high-dimensional state spaces. However, it is relatively complicated, and it is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks), and it has poor data efficiency. To address this problem, the PPO method emerged. PPO converts the constraint term of TRPO into a penalty term to decrease the complexity of constrained optimization problems, and it uses only first-order optimization [
18]. Multi-agent PPO (MAPPO) is a version for the multi-agent partially observable Markov’s decision-making process [
22]. The multi-agent DDPG (MADDPG) method is another state-of-the-art method besides MAPPO. In MADDPG, each agent takes the other agents as part of the environment, and agents in the same region cooperate with each other to determine the optimal coordinated action [
19]. However, it is hard to achieve stability due to the complexity of the hyperparameters.
In [
23], a multi-USV automatic collision-avoidance method was employed based on a double deep Q network (DDQN) with prioritized experience replay. In [
24], the PPO algorithm and a hand-crafted reward function are used to encourage the USV to comply with the COLREGs rules.
Sawada et al. [
25] proposed a collision-avoidance method based on PPO, and it uses a grid sensor to quantize obstacle zones by target and a convolutional neural network (CNN) and long–short-term memory (LSTM) network to control the rudder angle. Xu et al. [
26] proposed a COLREGs intelligent collision-avoidance (CICA) algorithm that tracks the current network weight to update the target network weight, which improves stability when learning the optimal strategy.
The study in [
27] employed a collision-avoidance framework that divides all encounter scenarios into seven types according to the avoidance constraints of the COLREGs for different encountered scenes.
In [
28], the COLREGs and ship maneuverability were considered in the reward for achieving multi-ship automatic collision avoidance, and the optimal reciprocal collision-avoidance (ORCA) algorithm was used to detect and reduce the risk of collision.
In the work by [
6], the action space and reward function were improved by incorporating the reciprocal velocity obstacle (RVO) scheme. Gate-recurrent unit-based networks were utilized to directly map the state of varying the number of surrounding obstacles to the corresponding actions.
In [
5], the collaborative path-planning problem was modelled as a decentralized partially observable Markov decision-making process and used to devise an observation model, a reward function, and an action space suitable for the MAPPO algorithm for multi-target search tasks. In the research described in [
29], a system was constructed to switch between path-following and collision-avoidance modes in real time, and the collision hazard was perceived through encounter identification and risk calculation.
In [
30], the Q learning method was applied for optimizing the state action pairs to obtain the initial strategy, then PPO was used to fine-tune the strategy. In [
31], agents were trained using a mixture of observations from different training environments and linearity constraints were imposed on both the observation interpolations and the supervision (e.g., associated reward) interpolations. In the study described in [
32], multiple targets were simultaneously optimized to enhance the PPO.
The comparative methods are outlined in
Table 1. The majority of current DRL methods focus on generating COLREGs-compliant paths using a COLREGs-based reward function. However, the high degree of randomness in early-stage action selection can lead to unpredictable strategy gradient updates, making it challenging to achieve model convergence.
Moreover, in the case of USVs, it is crucial for their paths to be both feasible and optimal, while ensuring that multiple USVs can maintain their formation and reach their respective goals [
33,
34]. Additionally, being COLREGs-compliant does not guarantee an ideal evasive behavior as it may result in overly conservative or inattentive responses to unexpected TSs.
By leveraging a large dataset of pre-recorded USV trajectory data and employing powerful DRL-based methods, it is possible to derive promising solutions that not only ensure that USV navigation adheres to COLREGs rules but that it also replicates the good seamanship exhibited by experienced mariners. Furthermore, path planning in dynamic environments poses a complex problem due to the need to plan multiple paths simultaneously while ensuring collision avoidance within the USV fleet and between USVs and TSs. Therefore, it is crucial that the planner is efficient. MAPPO offers high efficiency in learning, fast convergence, and improved stability, making it an ideal choice as the basic path planner for our task.
Subsequently, this research seeks to propose a two-stage multi-agent reinforcement learning (MARL) scheme based on the MAPPO algorithm, incorporating a centralized training and a decentralized execution strategy. The following innovations are presented:
We introduce a COLREGs-compliant action-evaluation module to compute action probabilities that align with COLREGs regulations when encountering multiple TSs. The module parameters are learned from a dataset of pre-recorded USV trajectories. By fusing the probability vector and the candidate action vector from the actor network, we select an action that is the most feasible for the encountered situation. Our reward function incorporates both COLREGs and seamanship considerations, providing a dual heuristic approach to guide the selection of COLREGs-compliant actions.
We propose a policy network that can handle multiple aggregation goals, obstacles, and dynamic TSs. To achieve this, we have defined the action space, observation space, and reward function for the policy network. Additionally, we have designed actor and critic networks.
A TS motion-detection network is constructed to provide guidance for the decision-making process of the MARL model.
3. MAPPO Algorithm Design
3.1. Observation Space Design
Each USV can observe the environment information, including the information of TSs within its sensing range, and they share their observations to construct the global observation through communication [
5]. The observation consists of the positions and navigation states of both USVs and TSs, as well as the environmental information, such as the positions and shapes of obstacles. The observation vector has a fixed dimension, and any missing values are filled with zeros.
The inner formation observational features are employed to describe the state of the USVs within the communication range. For , the observational feature of the neighboring is set to = {, , , , , , , , , }, where is the location of the estimated aggregation target that is available to all USVs, and and represent the velocity components of in the and directions, respectively. To avoid complex coupling between velocities, orthogonal components are used to express the actions of the USV. In this context, and represent the position of the first vessel (denoted as ), while and represent the position of another vessel (denoted as ). The variable represents the TSs and obstacle features observed by . The variables and represent the differences between the velocities of and , respectively. Specifically, this equation considers three USVs, four obstacles, and three TSs in the vicinity of a USV.
3.2. USV-State Feature Design
The USV-state feature is represented as = {, , , , , , , , , }, where represents the name of the USV state, indicates whether the USV has reached an aggregation goal, represents the collision state of the USV, represents the optional actions determined by the current USV state, and and represent the USV position.
To address the restricted visibility problem in narrow channels, practical communication is simulated by incorporating observations from USVs with added noises into a global observation during the model training process. The USV can take communication actions to broadcast information to other USVs, and represents the communication utterance transferred from to , while represents the communication noise. Noise is added to the communication utterance of a USV to simulate the practical environment.
3.3. Reward Function Design
The reward calculation combines guidance rewards and sparse rewards. During task execution, each USV receives a guiding reward at each time step, which aims to guide the USVs to chase the aggregation targets and to form a formation. The average reward of all USVs is calculated as the collective reward of the USV formation. The reward function, consists of guiding rewards, COLREGs-compliant rewards, collision rewards, and time-consumption rewards weighted accordingly.
To guide the USVs to approach the aggregation targets during the early stage of training, the USVs receive negative rewards based on their distances to the targets at each time step, until the USVs reach the target.
where
is the coefficient that limits the range of reward changes; if the distance
of a USV to its nearest target is less than
, the USV is considered to have reached the aggregation point. We set
= 1,
= 20, and all the hyperparameters are set empirically.
The guiding reward is the estimation of USVs reaching the targets, while the time-consuming rewards reflect the actual time consumption from the start points. A USV will receive a negative reward of (−0.1) if it takes one additional step.
The control on turnings is more difficult than that on a linear path; thus, a USV will receive a negative reward if it needs to perform a steering maneuver. We assign a reward of = −0.6 to turning points when the trajectory turning angle exceeds 45 degrees.
The obstacle-avoidance reward consists of two parts, which are the static obstacle-avoidance reward and the dynamic obstacle-avoidance reward. The static obstacle-avoidance reward is calculated using the following formula.
where
is the minimum distance between a USV and its obstacles;
is the weight, which we set to be 1; and
represents the reward amplitude. If
> 600, then
= 0; when 300 ≤
≤ 600,
= 1.5; when 100 ≤
<300,
= 2; and if
< 100, then
= 10. The reward is a negative value that decreases exponentially along with the decrease in the distance between the USV and the obstacles.
The dynamic avoidance reward considers the USVs avoiding sailing obstructions (including other USVs and TSs) in terms of the COLREGs. If a USV action enters the collision areas of other USVs or TSs and the minimum distance, from the to other sailing obstructions is smaller than 600, then a negative reward, denoted as , is added to the total reward. As shown in Formulas (7) and (8), the reward consists of two parts: the COLREGs-compliant reward, which is based on adherence to the rules, and the obstacle-avoidance reward, which focuses on avoiding vessels regardless of the rules.
Once the encountering scenario is determined by the COLREGs, the calculation of the COLREGs-compliant reward will remain unchanged. During the dynamic obstacle-avoidance procedure, the USV continuously monitors other vessels. If another vessel that is responsible for giving way according to the COLREGs does not comply with the rules, the USV takes collision-avoidance actions based on their distance, regardless of the specific rules.
The dynamic obstacle avoidance constant,
Rrule, is computed as:
where the weight,
α, is set to be 0.5.
The head-on reward is shown in Formula (9). According to the COLREGs, the USV should turn to the starboard side in a head-on scenario; if the USV turns left, a penalty will be added to the reward. The rewards for the give-way and stand-on situations, shown in Formulas (10) and (11), are similar to the head-on reward. There is no guidance on the action the USV should take to avoid a collision in the overtaking situation. Therefore, we do not define a specific reward function for this scenario but, instead, directly use the distance to evaluate the avoidance actions of the USV. Finally, the head-on reward, give-way reward, and stand-on reward are combined to form the COLREGs-compliant reward, as shown in Formula (12).
3.4. Action Space Design
To ensure the planning efficiency, we use the discrete action space that distributes uniformly by 45 degrees around a USV. As shown in
Figure 5, the action probabilities are determined by the actor network; the appropriate actions in green are selected, while actions in red are suppressed.
3.5. Policy Network Design
Figure 6 shows the policy network in this research. The environment considers that USVs sail confronting static obstacles and TSs. An observation block is constructed for each USV to map the USV state feature,
to the observation feature,
by interacting with the environment.
The COLREGs-compliant action-evaluation module is used to distinguish the actions satisfying the COLREGs from those violating the rules. The MLP-based actor network is responsible for calculating the reward, and mapping the observation feature into the optimal action, . The critic network, which assesses the chosen action, is also established using the MLP network, and it outputs the value .
To optimize policies, the network alternates between sampling data from the policy and performing optimization on the sampled data. The policies represented by the , , , , vectors are saved in the replay buffer. We use the mini-batch method. After collecting data for multiple epochs, the data is transferred from the buffer to the policy gradient calculation module for the optimization of the actor and critic networks. By sharing weights among networks, the planner can learn from the cooperative action-selection experiences. When the policy network is used for path planning in a real environment, the actor networks plan actions based on the observation features without critic networks.
The loss function of the network adopts the clip strategy [
18], with
= 0.1, as follows.
3.6. COLREGs-Compliant Action-Evaluation Network Design
The COLREGs-compliant action-evaluation module is built based on the MLP network. We used a pre-training strategy, training the COLREGs-compliant action-evaluation network separately from the other parts of the policy network. The COLREGs-compliant action-evaluation network is pre-trained using pre-recorded USV trajectory data, which includes the state vector and global observation vector, as well as the USV actions for avoiding the TSs. The action labels align with our action space. The final established dataset consists of 5000 data items, with an almost equal number of avoidance behaviors for each category (head-on, crossing and give-way, crossing and stand-on, and overtaking), ensuring a balanced dataset.
The COLREGs-compliant action-evaluation network consists of three fully connected (FC) layers, with parameter sizes of 18 × 64, 64 × 64, and 64 × 7 for the respective layers. The Tanh activation function is applied after the first two FC layers, and the softmax function is used following the last FC layer to obtain the COLREGs-compliant probabilities of actions. During the pre-training procedure, the CrossEntropyLoss function is utilized. The remaining parameters, including the optimizer and the learning rate, are set to be identical to those of the policy network. The pre-trained network is utilized directly without applying any parameter updates via the policy gradient algorithm.
The probability vector of actions compliant with the COLREGs will be multiplied by the dot product with the probability vector of chosen actions to select the final action with the highest result.
In
Figure 7, the illustration depicts that the gym environment supplies the state vectors of the entities to the path-planning model. Additionally, the TS motion-detection network offers TS motion information, which includes courses and bearings, to the planner. Subsequently, the state vector and the observation vector are concatenated to form the input vector for both the actor network and the COLREGs-compliant action-evaluation network.
By integrating the prediction vectors from both the actor and action-evaluation networks, the policy network is capable of making well-informed decisions concerning the USV’s path-finding and collision-avoidance actions while following the COLREGs.
4. TS Motion-Detection Network Design
This research recognizes the TS intention by the time series of RGB images. The bow, stern, and whole body of TSs on consecutive frames are detected. TSs are detected at different scales since they are sometimes relatively small in the overall picture. Subsequently, the motions of the TSs can be identified using the inter-frame difference scheme.
Practical TS images are relatively rare; thus, we adopt the pre-train and fine-tune paradigm to train the network. Firstly, the network is pre-trained using the SeaShip dataset [
36]. The dataset is designed to train and evaluate ship object-detection algorithms. The dataset consists of 31,455 images covering six common ship types (ore carriers, bulk carriers, general cargo ships, container ships, fishing vessels, and passenger ships). All images are from approximately 10,080 real-world video footage captured by surveillance cameras in a deployed coastline video surveillance system. The data has been carefully selected to cover all possible imaging variations, such as different proportions, hull components, lighting, viewpoints, backgrounds, and occlusion.
After our TS detection model is pre-trained on the SeaShip dataset, the model is finetuned using the real-scene images to generalize it to our task.
The TS detection model is shown in
Figure 8. The network is constructed based on the YOLO model [
37]. The network utilizes the spatial pyramid pooling-based feature (SPPF) network to capture spatial features of images. The cross-stage partial (CSP) network is employed to enhance the CNN block and to reduce the computational burden [
37]; CSPi_j refers to the
th CSP module utilizing
residual components. The CBS module comprises a convolutional network, a batch normalization network, and SiLU activation. It is utilized for feature extraction.
The channel attention mechanism, namely squeezing and excitation (SE), is combined with the YOLO network to boost the feature-extraction ability, as shown in Formula (14) and
Figure 9.
5. Experiments
5.1. Experimental Setting
Our MAPPO-based path-planning program runs on a computer equipped with an Intel(R) Core(TM) i7-7820HQ CPU @ 2.90 GHz and 16 GB of memory. Our planning model is compared with the MADDPG model in experiments. Due to the complexity of the MADDPG algorithm, we utilize a more powerful computer to run the program. The configuration is as follows: the operation system is Ubuntu 20.04.6 LTS, the CPU is Intel(R) Xeon(R) Silver 4214 @ 2.20 GHz with 48 cores and 2 threads per core, and the computer has 2 memories, each with a capacity of 8192 MB.
5.2. The Path-Planning Model Training Pipeline
The MAPPO-based path-planning model is trained using a multi-threaded approach, where each thread establishes a virtual environment to fully select data points by importance.
Observations are captured first, then step actions for a USV are captured in terms of rewards from the actor network until an episode is reached. The step rewards of the episode or a path are summed up and then normalized. The normalization is used to eliminate the impact of different scales of metrics.
The maximum number of steps per episode is 25, which means that the planner generates a local path of no more than 25 steps in the local planning time domain. The critic network aims to assess the cooperative action-selection strategy of the actor network by solving the value function. The obtained values are also normalized.
We adopt the mini-batch scheme, and the batch size is set to be 32 episodes. After collecting a batch of data, the actor and critic networks are trained three times, and the average metrics are used to compute the strategy gradient.
The learning rate is set to be 5 × 10−4. The discount factor when calculating the rewards is set to be 0.99. The policy entropy coefficient is set to be 0.01. The epsilon value of the Adam optimization is set to be 1 × 10−5. The simulation time step is 0.1 s. We use the learning-rate-decay method and gradient clip tricks when updating the network parameters. The orthogonal parameter initialization scheme is applied to the MLP layers of both the actor and critic networks to keep the weight matrixes orthogonal, thereby reducing the problem of the gradient vanishing and exploding.
The MLP blocks of the actor network consists of three fully connected (FC) networks, and the parameters are 18 × 64, 64 × 64, and 64 × 7 for the respective layers. The Tanh activation function is applied after the first two FC layers. Subsequently, the softmax function is used following the last FC layer to obtain the probabilities of choosing actions.
The MLP blocks of the critic network consist of three FC networks, with parameters of 54 × 64, 64 × 64, and 64 × 1 for the respective layers. The Tanh activation function is applied after the first two FC layers. The value can be obtained from the last FC layer.
5.3. Path-Planning Results
We configure Monte Carlo simulation environments for training the planner network whereas the aggregation targets for USVs and obstacles are randomly located, and the linear TS routes are also randomly generated. The planning objective of the USVs is to reach the aggregation targets while avoiding TSs and obstacles.
We trained our model for 1 × 10
6 iterations.
Figure 10 shows the results of the Monte Carlo experiment for USV cooperation. The black-filled circles represent obstacles, the colored circles represent the aggregation positions of USVs, stars indicate the estimated aggregation targets, the triangles indicate the start points of the USVs, and the colored lines depict the obstacle-avoidance paths. Different colors are used to distinguish the relevant elements of different USVs. Since the planner needs to be trained based on the current model in a new environment that is significantly different from the known one, we train the planner for 1 × 10
6 iterations for efficiency.
We have observed that our planner can efficiently generate obstacle-avoidance paths in less than 1 s, and these paths can effectively avoid collisions while reaching the aggregation targets to maintain the formation. The success rate of achieving multiple objectives is higher than 95%. These observations testify to the efficiency and effectiveness of our method.
We train our planner for a total of 1 × 10
6 and 6 × 10
6 iterations. We evaluate the network and record metric values per 5000 training iterations, and the evaluation reward curves are shown in
Figure 11. The action-space dimension is 7 and the observation-space dimension is 24.
Figure 12 and
Figure 13 depict the training-reward curves for the MADDPG and multi-agent TD3 (MATD3) algorithms. MATD3 improves upon the original DDPG algorithm by incorporating additional techniques, such as using twin critic networks and delayed policy updates, to improve the coordination and learning in complex environments. Due to the complexity of the training process and the long duration required for each iteration, we trained both the MADDPG and MATD3 algorithms for 1 × 10
6 iterations.
We observed that the reward curve of our method converges as the training iterations approach 1 × 10
6. The oscillation is relatively low, indicating that the method has relatively high stability.
Figure 11b shows the training-reward curve for 6 × 10
6 training iterations. The rewards decrease as the training iterations increase, indicating that the method scales well with a large number of training iterations.
The curve became steady as the iteration time approached 5 × 106, which probably indicated that the model was sufficiently trained. Since training efficiency is also an important metric for the online planner model, we benchmarked our method against the state-of-the-art MADDPG model and the MATD3 model after training for 1 × 106 iterations. The oscillations in the reward curves of MADDPG and MATD3 are significant, and the decreases in rewards are not obvious. The observation implies that MADDPG and MATD3 may require more training iterations than our method.
Meanwhile, the duration of training for our MAPPO-based method with 6 × 106 iterations was less than 5 h, whereas the training durations for MADDPG and MATD3 with only 1 × 106 iterations exceeded 10 h. The above observations demonstrate that our method has a much higher training efficiency than MADDPG and MATD3 due to the low complexity of MAPPO.
The results of the Monte Carlo simulations show that our MAPPO algorithm outperforms both the MADDPG and MAPPO algorithms in terms of the average path length and the steering angle. Specifically, the average path length of our MAPPO algorithm is 0.489, which is lower than that of the MADDPG and MAPPO algorithms (0.683 and 0.582, respectively). This indicates that our algorithm is able to plan more optimal paths. Moreover, the average steering angle of our MAPPO algorithm is 2.14 rad, which is lower than that of the MADDPG and MATD3 algorithms (2.35 and 2.23, respectively). This suggests that our algorithm is also more efficient in steering the vehicle. Overall, these observations provide evidence that our method is capable of planning more optimal paths than the MADDPG and MATD3 models after being trained for 1 × 106 iterations in the simulation environment of the Monte Carlo simulations. This result highlights the potential of our algorithm to improve the performance of autonomous driving systems.
5.4. COLREGs-Based Collision-Avoidance Experiments
Figure 14 depicts the simulation results of USVs avoiding TSs, with randomly generated positions for both the USVs and the obstacles. The TS paths are also generated randomly. The obstacles are represented by black-filled triangles, rectangles, and circles. The USVs plan their paths to reach the aggregation targets, considering both low-cost paths and collision avoidance among the USVs and TSs. The red curves represent the paths of the three USVs, while the blue curves depict the paths of the TSs. The start points of the TS paths are indicated by blue triangles.
Figure 14 illustrates two representative scenarios of avoiding collisions in accordance with the COLREGs. USVs typically prioritize the most significant collision risks in the nearby area and subsequent collision-avoidance issues, taking actions according to the COLREGs to avoid a collision. The upper sub-figure illustrates that the first USV not only overtakes the second USV by turning to the starboard side in accordance with the COLREGs but also gives way to the TS crossing from the right by turning to the starboard side. Similarly, the second USV follows the COLREGs by turning to its starboard side when heading toward a TS.
The lower sub-figure illustrates that the first USV overtakes the third USV by maneuvering to its starboard side. When a TS crosses from the port side of the third USV, the USV stands on its course if it can pass the TS without risking a collision, in accordance with the COLREGs. In the subsequent segments of the voyage, the third USV turns to its port side to overtake the second USV. Simultaneously, the second USV adjusts its course to the starboard side to give way to the TS.
Figure 14 demonstrates that our path-planning method is capable of effectively planning multiple USV paths to achieve their targets while ensuring collision avoidance among USVs and TSs in accordance with the COLREGs regulations.
5.5. TS Detection Results
In the TS detection experiments, we detect the head, stern, and the entire body (i.e., all) of the TS. The corresponding PR curves are shown in
Figure 15. When any part of the TS is detected, the motion intention of the TS can be captured by the detection results in consecutive frames using the inter-frame difference scheme. The blue curve indicates the mean average precision (mAP) curve when the intersection-over-union (IoU) threshold is 0.5. The metric values may imply that the model achieves a certain balance between recall and accuracy, and it demonstrates a relatively high overall performance. In the evaluation of the TS detection network on the SeaShips dataset, the 10-fold cross-validation method was employed. This method involves dividing the dataset into 10 subsets or folds of approximately equal size. For each fold, 70% of the images were used as the training set, while the remaining 30% of images were used as the test set. It is important to note that the training set and the test set did not have any overlapping images.
Figure 16 presents the TS detection result curves from different perspectives. The training loss and validation (val) loss curves demonstrate that the model converges quickly. The precision, recall, and mAP curves imply high performance of the model.
Figure 17 shows the result of the model detecting two TSs, and the model performed well in our TS detection task. The effectiveness of our model may benefit from the following factors. Since we use the time serial images of TSs, the detection of any part of a TS in consecutive frames is enough to recognize the motion of the TS. Meanwhile, the detections are conducted several times per second.
6. Conclusions
This research proposes a two-stage path-planning method for multiple USVs based on the COLREGs. The method combines a cooperation module, a COLREGs-compliant action-evaluation module, and a TS detection module. The cooperative path-planning model for collision avoidance among USVs is constructed based on the MAPPO strategy, which utilizes a policy network capable of handling multiple aggregation goals, obstacles, and TSs. To achieve this, we define the action space, observation space, and reward function for the policy network, and design actor and critic networks.
Monte Carlo experimental results confirm the effectiveness and efficiency of our path-planning method for formation aggregation and collision avoidance. We conducted these experiments by randomly specifying the positions of the USVs and obstacles. This approach allowed us to evaluate the performance of our method in diverse scenarios and validate its robustness.
We benchmarked the simulation results against the MADDPG and MATD3 methods to validate the efficiency and the optimization performance of our approach.
After training the COLREGs-compliant action-evaluation calculation module using TS-avoiding trajectories, violations of USV actions that go against the COLREGs can be recognized and suppressed. This judgment is then used as heuristics for the actor network. Our reward function considers both COLREGs and seamanship principles.
The TS detection network is constructed based on the YOLO network and the squeeze-and-excitation scheme.
Primarily, we evaluate the algorithm through simulations conducted in gym environments. Additionally, we have conducted experiments on a semi-physical simulation platform where our algorithm acts as a local path planner, guiding the navigation of the system. The cooperative path-planning module is executed on individual ship-borne computers (PC-104) installed on each USV member. VxWorks 6.6 is utilized as the operating system, with Workbench 3.0 as the chosen development package.
In our future experiments, we aim to further advance our research by translating the algorithm into physical USVs. This will allow us to conduct practical implementation and comprehensive testing of the algorithm’s capabilities.