1. Introduction
In the era of Industry 4.0, one key aspect of achieving smart factories is reducing reliance on human labor through collaborative automation using robot manipulators [
1]. This reduction in reliance has become particularly evident in various tasks, including factory assembly lines, space exploration, customer service, as well as exploration and rescue missions. Consequently, enhancing the operational efficiency of multi-arm manipulators has become of utmost importance.
In the manufacturing industry, human experts manually search for collision-free paths for robotic manipulators to perform specific tasks. Robot path planning research commonly uses sampling-based algorithms which sample nodes from collision-free spaces to compute paths between the initial and target points [
2]. Traditional path planning algorithms like Dijkstra’s and A* find the shortest paths by selecting nodes and updating distances. However, they have limitations in terms of speed and handling negative-weight edges. The rapidly-exploring random tree (RRT) algorithm is a probabilistic sampling-based approach for motion planning but lacks optimality and smoothness. Probabilistic roadmap (PRM) is another probabilistic sampling-based method that constructs graphs to represent feasible paths but may face efficiency challenges in high-dimensional spaces. Artificial potential field (APF) guides movement using artificial potential fields but can become stuck in local minima [
3,
4,
5,
6]. However, these methods are limited to single-arm robots, and become more challenging for multi-arm manipulators due to increased dimensionality [
7]. Additionally, multi-arm manipulators require efficient and optimal path learning, even with arbitrary starting and target positions. Recent advancements in model-free reinforcement learning have successfully addressed single-arm robot path planning [
8]. Reinforcement learning (RL) is a machine learning methodology with the objective of enabling an agent (or an agent, in common parlance) to acquire the ability to take actions through interaction with its environment with the aim of maximizing the cumulative reward. Multi-agent reinforcement learning (MARL) represents an extension of RL, encompassing scenarios where multiple intelligent agents coexist and interact simultaneously. Within the domain of MARL, each individual agent retains its capacity for autonomous decision making; however, it is essential to acknowledge that the decisions made by one agent have consequential effects on both its peers and the overarching environmental context. Consequently, the domain of intelligent agent reinforcement learning encompasses the study of interactions and cooperative behaviors among multiple intelligent agents. The utilization of reinforcement learning in the context of multiple agents is of paramount importance when addressing the intricate challenge of path planning for dual-arm robots [
9].
In the realm of robotic arm control, deep reinforcement learning techniques have demonstrated the ability to generate smoother trajectories compared to conventional methods. These approaches also excel in planning robotic arm movements in dynamic environments [
10]. Model-free reinforcement learning has made significant progress in robot planning and control, allowing autonomous agents to acquire skills through interaction with their environment, thereby simplifying the complexities associated with the modeling and manual tuning of controller parameters.
Reinforcement learning demonstrates its effectiveness prominently in projects like Dex-Net, which leverages comprehensive object datasets and nuanced object attribute modeling to develop effective gripping strategies for robotic manipulators. This process involves optimization through the application of reinforcement learning methodologies [
11,
12,
13]. The manipulator’s grasp-related actions are carried out within a simulated environment, where the quality of the grasp is evaluated using predefined reward functions.
DQN (deep Q-network) and Rainbow are notable examples in the realm of value-based reinforcement learning methods. DQN combines deep neural networks with Q-learning techniques tailored to address the challenges posed by high-dimensional state spaces. Rainbow is an extension and refinement of the foundational DQN framework [
14]. It incorporates various reinforcement learning enhancements to tackle issues encountered by DQN in complex tasks, such as overestimation and low sample efficiency [
15]. In contrast, SAC (soft actor-critic) and DDPG (deep deterministic policy gradient) are two policy-based reinforcement learning approaches. DDPG is designed to handle reinforcement learning problems in continuous action spaces. SAC employs the maximum entropy policy to strike a balance between exploration and exploitation by maximizing entropy in the decision-making process, promoting more effective exploration of the environment. Google’s robot learning in homes initiative harnesses reinforcement learning and transfer learning paradigms to enable robotic systems to assimilate knowledge from simulated environments and subsequently apply it to real domestic settings [
16]. By executing tasks and learning through interactions within simulated environments, the robot then adapts its acquired proficiency to diverse real-world home scenarios via transfer learning methodologies [
17].
Classic deep reinforcement learning algorithms such as proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), and soft actor-critic (SAC) have been proven to solve planning problems for single-arm robots [
18,
19,
20].
However, single-arm robots have limitations in operation and control. Dual-arm robots with coordinated operations and excellent human–robot collaboration capabilities demonstrate clear advantages. As an illustration, the MADDPG algorithm exemplifies a class of multi-agent reinforcement learning methodologies adept at managing cooperative and competitive interactions between dual robotic arms [
21]. This capability allows these intelligent agents to collaborate harmoniously, working in concert to attain shared objectives. Furthermore, it is noteworthy that the MADDPG algorithm exhibits scalability, extending its applicability to systems encompassing a substantial quantity of agents, rendering it particularly well suited for addressing challenges within intricate multi-agent environments.
Since path planning for dual-arm manipulators is a high-dimensional problem, existing reinforcement learning-based path planning algorithms may suffer from poor exploration performance, resulting in suboptimal generated paths [
22]. Providing arbitrary starting and target positions for reinforcement learning-based path planning can pose challenges due to physical constraints or the high dimensionality of the configuration space, leading to sparse rewards as the agent struggles to find the shortest path [
23].
In order to improve the real-time planning efficiency of dual-arm robots in dynamic environments and enhance exploration efficiency in high-dimensional spaces, this paper proposes a dual-arm robot path planning algorithm based on the deep deterministic policy gradient (DDPG) algorithm. The contributions of this paper can be summarized as follows:
- (i)
In order to address the path planning challenges inherent in the dual-arm robot system, we have devised a dedicated configuration space tailored to accommodate the specific setup of dual arms. Furthermore, we have formulated a path cost function with the aim of facilitating efficient path planning under these circumstances.
- (ii)
To tackle the issue of reward sparsity in multi-agent reinforcement learning, a series of strategic measures have been implemented, including the introduction of a replay buffer and the expansion of experience storage. These endeavors are intended to harness existing experiential data to generate more valuable training samples.
- (iii)
To minimize the cost associated with path planning, we have integrated A* shortest-path constraints into the loss function. Additionally, our algorithm has undergone validation and testing within the pybullet simulation environment, the results of which confirm its efficacy in addressing the reward sparsity challenge in reinforcement learning and enhancing path planning for dual-arm robots.
2. Related Work
This chapter primarily provides an overview of research related to the utilization of multi-agent reinforcement learning for the control of robotic arms. When employing multi-agent reinforcement learning methods for controlling a robotic arm, it is essential to pre-configure the workspace. In other words, the design of the robotic arm’s configuration space is a prerequisite for conducting deep reinforcement learning.
Concurrently, multi-agent reinforcement learning methods are confronted with the challenge of reward sparsity, which can give rise to specific issues such as learning stagnation and excessive exploration. Therefore, mitigating the problem of reward sparsity holds significant academic and practical significance.
2.1. Multi-Agent Reinforcement Learning
Reinforcement learning is a decision-making process based on Markov decision processes (MDP), which is a mathematical framework for describing stochastic processes in decision problems. It extends Markov chains and decision theory and is widely applied in reinforcement learning and optimization problems. Reinforcement learning methods can be categorized into value-based methods and policy-based methods [
24]. Value-based methods use techniques like deep Q-networks (DQN) to approximate the optimal value function and derive the corresponding optimal policy from the approximated value function [
25]. Policy-based methods directly compute the optimal policy based on the agent’s experience.
In multi-agent tasks, policy-based methods (also known as policy gradients) have shown better performance than value-based methods in continuous action tasks [
26]. Therefore, in the context of robot path planning, multi-agent reinforcement learning methods primarily employ policy-based methods to solve tasks.
MADDPG and MATD3 are two multi-agent reinforcement learning methods. MATD3 (multi-agent twin delayed deep deterministic policy gradient) is a multi-agent reinforcement learning algorithm proposed in 2018. It is an extension of the TD3 algorithm to multi-agent environments, aiming to address cooperation or competition issues among multiple agents [
27].
The MADDPG (multi-agent deep deterministic policy gradient) algorithm represents a significant advancement in addressing challenges inherent to multi-agent reinforcement learning problems. In the MADDPG algorithm, each agent has its own actor network and critic network, which are based on the DDPG algorithm. The actor network is responsible for generating actions based on the agent’s state, while the critic network evaluates the value of actions. Notably diverging from conventional single-agent DDPG algorithms, the critic network within MADDPG encompasses an additional dimension: it evaluates the worth of actions not only in light of an agent’s personal state and actions but also in consideration of the states and actions undertaken by fellow agents. During the training process, both the actor and critic networks of each agent need to be optimized. The target value for the critic network depends not only on the agent’s own state and actions but also on the states and actions of other agents. By virtue of the MADDPG algorithm, a mechanism is established through which multiple agents can actively engage and acquire knowledge within their environment. This engagement facilitates the refinement of their distinct strategic approaches while concurrently fostering the exchange of pertinent information. This dual process culminates in the attainment of collaborative decision-making dynamics and the emergence of cooperative behaviors among the agents.
In the process of utilizing deep reinforcement learning for dual-arm robot path planning, the design of the configuration space holds a pivotal role. The configuration space serves to delineate feasible poses of the robot and facilitate trajectory planning. A significant challenge encountered in deep reinforcement learning pertains to the sparsity of re-wards during the learning process. Next, we have expounded upon the endeavors pertaining to configuration space design and initiatives directed towards the amelioration of reward sparsity.
2.2. Configuration Space
The establishment of a robot’s configuration space represents a pivotal component within the realm of robot motion planning, underscoring its profound significance within this domain. This workspace configuration is represented as , which refers to all possible combinations of the robotic arm joint angles. Typically, we use a joint angle vector to represent . Joint space path planning is the process of generating a curve of joint variable changes under given constraints of joint angles (such as start point, target point, or positions, velocities, and accelerations of intermediate nodes). In joint space, each vector represents a position and orientation of the robotic arm’s motion. Specifically, assuming the robotic arm uses n joint mechanisms, the workspace configuration can be defined as a subset of an n-dimensional vector space, where each vector represents a joint angle vector, i.e., , where qi represents the angle of the i-th joint, and and represent the minimum and maximum allowable angles for that joint.
By defining and describing the configuration of the workspace, we can determine the feasible range of motion for the robotic arm’s joint mechanisms, enabling path planning and control. The grid-based method divides the configuration space into grid cells, where each cell represents a feasible configuration region. The sampling-based path planning method divides the configuration space into smaller regions called cells. The local segmentation method divides the configuration space into local subspaces or regions. Prior knowledge or rules can also be utilized to divide the configuration space. For example, based on prior knowledge such as the geometric shape of the robotic arm, joint limitations, and task requirements, the configuration space can be divided into reasonable regions.
Tom et al. introduced a novel approach to enhance robotic arm path planning by integrating the firefly algorithm with Q-learning. Through a Q-learning policy, the optimal parameter values of the firefly algorithm are learned, leading to improved efficiency. This technique has been successfully applied to robotic arm path planning, yielding promising outcomes. In a separate study [
28], a path planning strategy for continuum arms was proposed, emphasizing the robot’s workspace rather than its configuration space. Furthermore, another study [
29] presented a path planning algorithm employing the SAC (soft actor-critic) algorithm, facilitating rapid and effective path planning for multi-arm manipulators.
Traditional path planning methods struggle to perform well in complex high-dimensional environments. Additionally, treating multiple robotic arms as a single arm in reinforcement learning methods may not effectively control the coordinated motion of multiple arms. Therefore, using multi-agent reinforcement learning algorithms for multi-arm path planning is a worthwhile area of research.
2.3. The Problem of Reward Sparsity
In practical applications of deep reinforcement learning, the problem of reward sparsity has always been a core challenge [
30]. The agent struggles to obtain sufficient reward signals, leading to difficulties in learning and high time costs. The reward sparsity problem refers to the situation where the reward signal appears only in a few states or actions, making it difficult for the agent to learn the correct behavioral policy [
31]. This can result in challenges in the algorithm’s convergence during iterations.
Savinov et al. [
32] presented an innovative curiosity-based methodology aimed at mitigating the challenge of sparse rewards. This approach leverages episodic memory to generate novelty bonuses, thereby enhancing the learning process. Another noteworthy technique, reward shaping, involves the deliberate manipulation of reward functions to facilitate the acquisition of accurate policies by the agent. Notably, a recent contribution to the discourse on reward shaping methods is the research by Jin et al. [
33]. Hierarchical reinforcement learning (HRL) has emerged as a robust strategy for addressing challenges posed by long-horizon problems characterized by sparse and delayed rewards. In a work by Li et al. [
34], a comprehensive framework for HRL is introduced, integrating auxiliary rewards founded on intrinsic functions. This HRL framework stands as an efficient conduit for the harmonious acquisition of high-level policies and low-level skills, obviating the necessity for domain-specific expertise. In parallel, the HER (hindsight experience replay) algorithm [
35] proffers an alternate avenue for tackling the issue of sparse rewards. The core concept underlying HER is to recontextualize exploration failures as successful outcomes, thus optimizing the utility of exploration data in scenarios where rewards are scarce.
The experience pool stores
, and we expand it to
, where the corresponding goal information is additionally stored. Furthermore, the action policy is also dependent on the goal, denoted as
. Then, a complete agent experience sequence is sampled from the experience pool based on the actual goal
. Finally, the rewards are recalculated using the new goal g′. The reward formula is as follows:
3. Proposed Method
In this section, to design a path planning algorithm for a dual-arm robot based on MADDPG, we modeled the reinforcement learning problem as an MDP (Markov decision process) and used the HER (hindsight experience replay) algorithm to improve reward sparsity in MADDPG, thereby enhancing training efficiency and algorithm stability. Additionally, we expanded the configuration space of the dual-arm robot. In the given collaborative task, the dual-arm robot needs to learn how to cooperate and accomplish the task through a series of actions. This involves two agents controlling the robot in a non-deterministic environment with a continuous action space. During the training process, the strategies of each agent will evolve. The intelligent agents need to adapt to the environmental states and coordinate their action policies with other agents.
Figure 1 is the overall framework diagram of our algorithm. Upon the completion of each transition, it is customary to enqueue said transition into a designated queue referred to as the “Replay Buffer.” The capacity of this replay buffer is determined by a hyperparameter denoted as “n”, allowing it to store a maximum of n distinct transitions. In the event that the replay buffer reaches its full capacity, any newly incoming transition will replace the oldest existing transition within the buffer. The update process for both the actor and critic is carried out by uniformly sampling a minibatch of data from this buffer.
In MADDPG, each intelligent agent, during the forward pass of computing its own critic, concatenates the observations of all agents, including itself, into an observation vector , and concatenates all agents’ actions into an action vector . This pair serves as the input to the online critic network, which outputs a one-dimensional Q-value, denoted as . In other words, it employs the global information from agents in the environment (global observations and actions) to “centralize” the training of its own critic network. Having obtained Q-values and the ability to calculate using replay samples, the subsequent steps mirror those of DDPG.
During the forward pass of computing their individual actor networks, each intelligent agent only utilizes its own local observation vector as the input to the online actor network, which outputs a deterministic action i, denoted as . Subsequently, following the same procedure as DDPG, the mean squared error (MSE) loss function for temporal difference errors is computed, and gradients with respect to parameter are calculated. These gradients are then utilized for parameter updates using gradient descent.
is an experience replay buffer, composed of elements
. The update method for the centralized critic is inspired by the TD and target network ideas in DQN,
represents the target network, are the parameters of the target policy with delayed updates. The strategies of other agents can be obtained by fitting approximation without communication interaction.
As indicated above, the critic network utilizes global information for learning, while the actor network only relies on local observational information. One key insight from MADDPG is that if we have knowledge of all agents’ actions, the environment becomes stable, even when policies are continuously updated. This is because the dynamics of the environment remain stable and unaffected by policy changes:
The approximation cost is a logarithmic cost function, and with the entropy of the policy, the cost function can be written as:
As long as the above cost function is minimized, other agent strategies can be approximated. Therefore, y in Equation (8) can be replaced.
Before updating , update using a sample batch from experience replay.
In a multi-agent environment, each intelligent agent, at each time step t, selects an action from its action space based on its individual policy. The combination of these individual actions constitutes the joint action, while the aggregation of individual policies forms the joint policy. For a task involving N intelligent agents, the MADDPG algorithm consists of N policy functions and N evaluation functions. The gradient for the i-th agent is:
MADDPG introduces the concept of a policy ensemble. The policy for the i-th agent is represented by a collection of K sub-policies, and in each training episode, only one sub-policy
(abbreviated as
) is utilized. For each agent, we maximize the collective reward of its policy ensemble:
3.1. Motion Planning and Configuration Space for Dual-Arm Robots
Preceding the application of reinforcement learning, a foundational step involves the formulation of a problem model. MDP (Markov decision process) provides a formal method for describing problems with decision making and stochasticity. By modeling the problem as an MDP, the reinforcement learning problem can be transformed into a mathematical framework, making the problem more structured and solvable. MDP clearly describes the key elements of a decision-making process by defining states, actions, transition probabilities, and reward functions, among other components. This allows the essence and critical factors of the problem to be accurately captured and represented.
In preparation for the formalization of the problem into an MDP, a foundational prerequisite entails the definition of the configuration space intrinsic to the dual-arm robot. The dual-arm robot consists of two identical robotic arms with n degrees of freedom. Let represent the configuration space of the right arm, let represent the configuration space of the left arm, and denote the joint configuration space of the dual arms as . During the motion of the robotic arms, the configuration space that encounters collisions with obstacles is referred to as the obstacle space, denoted as . The configuration space without collisions is referred to as the free space, denoted as . Thus, the configuration space Q can be represented as the union of and , i.e., .
Assuming the initial configuration of the robotic arms is and the goal configuration is , the motion planning of the robotic arms can be represented as , where represents a collision-free motion path from the initial configuration to the goal configuration. For the planned motion path , the path cost is calculated using a composite function composed of multiple variables. In this paper, the path cost function consists of distance cost and smoothness cost. The distance cost calculates the distance between each node on the path and the target position using the Euclidean distance, which helps find the shortest path. The Euclidean distance refers to the straight-line distance between two points in Euclidean space. Define the distance cost function as . By calculating the Euclidean distance between the current position of the robotic arm and the target position, we can obtain the distance cost.
The smoothness cost is calculated according to the curvature of the path, and the smoothness cost is conducive to reducing the drastic changes in the motion of the manipulator. Rate cost measures the smoothness of a path by calculating the angle formed by three adjacent points on the path. Define the smoothness cost function as .
Therefore, the path cost function can be defined as follows:
represents the total cost of the path, where is the distance cost, is the smoothness cost, and and are coefficients that balance the two.
3.2. MDP Modeling for Dual-Arm Robot Path Planning
The MDP sequence unit is composed of a tuple
. Unlike the MDP in a single-agent scenario, here the transition function
and the reward function
are based on the joint action space
, where
and
. For agent
, all other agents except itself are denoted as
. In this case, the value function depends on the joint action
and the joint policy
.
Let
represent the random policy, and the value function
is obtained by taking the expected reward:
The MDP model consists of three elements: the action space, the reward space, and the state space. In the context of the robotic arm path planning problem in this paper, the speed of the arm’s movement is not considered. The MDP diagram in this paper is illustrated in
Figure 1. The paper proposes combining the target position in Cartesian space with the gripper’s opening and orientation to define the action space. The agent can simultaneously control the arm’s position and the gripper’s state to perform tasks such as grasping and placing objects. The state space includes the current joint angles, end-effector position and orientation of the robotic arm, information about obstacles and the environment, as well as the target task position and requirements. The reward function plays a crucial role in deep reinforcement learning algorithms as it determines the convergence speed and extent of the algorithm. In this paper, the reward function is defined based on the task and environment settings. When the end-effector of the robotic arm is within a fixed range ω near the target, the agent receives a positive reward to incentivize reaching the target point. If the next action of the robotic arm reduces the distance cost
or the smoothness cost
, the agent also receives a positive reward. On the other hand, if the robotic arm collides with other arms or obstacles, the agent receives a negative reward to discourage collisions.
3.3. Two-Arm Path Planning Algorithm Based on Depth Deterministic Strategy Gradient
The PPDDPG algorithm utilizes two agents to control two robotic arms and is trained using the DDPG algorithm. To promote cooperation between the agents, reduce competition, and enhance collaboration in overlapping action spaces, the two agents share their observations and actions with each other. Each agent adjusts its action policy based on the output of the other agent, enabling cooperation to be achieved. The block diagram of a single agent is shown in
Figure 2.
Each robotic arm has its own stochastic policy that only uses its own observations and actions , which is a probability distribution over actions given its own observations or a deterministic policy . Each robotic arm learns a centralized action-value function , where are the actions of all the agents. For each , is learned independently, so each robot can have arbitrary forms of reward functions, including conflicting reward functions in competitive environments. At the same time, each agent’s actor network explores and updates its policy parameters independently. Each robotic arm takes actions in the current state , receives a reward , and observes the next state . The tuple is then stored in an experience replay buffer . Once the size of the buffer exceeds a threshold, the networks start learning.
The online policy network comprises an architectural framework consisting of input, hidden, and output layers. The meticulous design of the input layer’s structure is of paramount importance, as it must be tailored to receive information that encapsulates the current state of the robotic arm. This state information may encompass various parameters, including but not limited to position, orientation (angles), and velocity, among others. The judicious selection of the appropriate number of hidden layers and their associated neurons is imperative. These hidden layers serve the crucial role of facilitating feature extraction and introducing essential non-linear transformations, thereby enhancing the network’s capacity to effectively represent the state and policy of the robotic arm. Additionally, the output layer assumes the responsibility of generating the actions to be executed by the robotic arm. It typically encompasses dimensions corresponding to the action space within which the robotic arm operates. In the context of a continuous action space, it is admissible for the output layer to incorporate multiple neurons, with each neuron dedicated to a specific action dimension.
The target policy network, akin to the online policy network in structural composition, diverges significantly in the aspect that its parameters remain invariable. In order to ensure the target policy network’s congruence with the online policy network, periodic parameter updates are necessary to sustain alignment between the two.
The A* algorithm is an efficient direct search method for finding the shortest path for a robot in a static working environment. The algorithm’s search speed depends on how close the estimated distances are to the actual values. Based on the A* algorithm, we can incorporate the A* shortest-path constraint into the loss function. The purpose of this loss function is to train a single robot to learn the shortest movement route using the A* path planning algorithm. The specific formula for expressing the loss function, denoted as
, can be expanded as follows:
represents the ground truth value of whether the action at taken by the agent at time t is the same as the A* algorithm’s action. If the A* algorithm chooses an action at time , is 1 (correct), otherwise it is 0 (incorrect). represents the probability of outputting this action by the policy at time , ranging from 0 to 1.
Algorithm 1 is the pseudocode for our proposed path planning algorithm for a dual-arm robot based on the depth deterministic strategy gradient algorithm. The details of the algorithm are as follows:
Algorithm 1: Path planning algorithm of dual-arm robot based on depth deterministic strategy gradient algorithm |
Initialize: target actor network with the parameter of each agent actor network with the parameter of each agent target critic network with the parameter of each agent critic network with the parameter of each agent |
Experience Replay Buffer 1. For episode = 1 to Max-episodes do 2. Initialize a random process for exploration of action 3. Sample goal 4. Get initial state s 5. For t = 1 to Max-step do 6. For each agent , select action from the action space, the current policy and exploration 7. Execute action and observe new state 8. End for 9. For t = 1 to Max-step do 10. For each agent , calculate reward 11. Store in replay buffer 12. Sample a set of additional goals for replay 13. For do 14. For each agent , 15. Store in replay buffer 16. End for 17. End for 18. For t = 1 to Max-step do 19. For agent i = 1 to 2 do 20. Sample a random minibatch of samples from 21. Set Update the parameter of critic’s evaluation network by minimizing the loss
22. Update the actor policy using the sampled policy gradient
23. End for 24. Update target network parameters for each agent : 25. End for 26. End for |
First, we initialize the actor and critic networks for each individual agent and initialize the experience replay buffer. The experience replay buffer is used to ensure that the samples are independently and uniformly distributed. At each step, both the actor and critic update their parameters by sampling mini-batches from the buffer. At the beginning of each episode, we initialize a random noise process and set an initial target goal. We obtain the initial state and start the iteration. Each agent executes an action based on its policy and the noise process. After each agent executes an action, we observe the total reward and obtain the new state. To improve the convergence of the algorithm, we employ the HER (hindsight experience replay) algorithm to process the samples. The new tuple is stored in the experience replay buffer, and the new state is used as the starting state for the next iteration.
5. Conclusions
This paper presents a novel path planning methodology, referred to as the path planning for dual-arm robots based on multi-agent deep deterministic policy gradient (PPDDPG) algorithm, aimed at addressing the path planning challenges specific to dual-arm robots. The proposed approach leverages advanced multi-agent deep reinforcement learning techniques to enhance path planning capabilities.
The core innovation of the PPDDPG algorithm lies in its tailored configuration space, designed to meet the unique path planning requirements of dual-arm robots. Additionally, our algorithm incorporates a replay buffer as a crucial component, enabling the reuse of previously acquired experiential data. This inclusion facilitates the practice of experience replay, strategically employed to address the challenge of reward sparsity encountered during the training process. We further integrate the A* shortest path constraint into the loss function with the aim of minimizing the path cost of the robotic arm.
To substantiate the implementation of PPDDPG, comprehensive simulations were conducted using the gym and pybullet environments. The experiments serve to demonstrate the effectiveness and advantages of our algorithm for motion planning in dual-arm robots.
Although our proposed algorithm has made remarkable achievements, it must be recognized that it has some limitations and needs further improvement. The tasks dealt with in this study are relatively basic, so it is necessary to conduct in-depth research in more complex manipulator planning scenarios. We believe that in an environment full of dynamic obstacles, the collaborative planning task in multiple manipulators is an important part of our future work.