1. Introduction
In an era marked by rapid advancements in industrial automation and artificial intelligence, robotic technologies are increasingly finding applications across a spectrum of fields, with a notable presence in the textile industry [
1,
2,
3,
4]. In this context, the automated operation and meticulous path planning of robotic arms have become crucial in elevating both the efficiency of production processes and the quality of end products. This development has spurred extensive research by scholars globally into intelligent control algorithms for robotic arms [
5,
6,
7,
8], leading to substantial advancements. Significantly, the utilization of deep reinforcement learning (DRL) algorithms for the control of robotic arms has gained prominence, positioning it as a key research focus within this evolving landscape.
Current research in DRL predominantly focuses on methods based on value functions, such as deep q-learning (DQN) and its variants, as well as on methods founded on policy gradients, like the policy gradient (PG) algorithm. However, these methodologies often encounter challenges when they are applied to continuous action spaces, particularly in terms of learning efficiency and path-planning effectiveness in sparse reward environments. For instance, Sangiovanni B et al. employed the DQN-NAF algorithm to control an industrial robotic arm model within the V-REP virtual environment [
9]. By crafting a well-structured dense reward function, they successfully enabled the robotic arm to perform tasks effectively while avoiding obstacles in the environment. Similarly, Mahmood A R and colleagues utilized the trust region policy optimization (TRPO) algorithm to adeptly control a UR5 robotic arm to reach target points [
10]. Their research emphasized the challenges and issues encountered when applying reinforcement learning algorithms to the control of real robotic arms. Wen S and others delved deeper into the application of the deep deterministic policy gradient (DDPG) algorithm for robotic arm motion planning [
11]. They not only examined motion planning in environments with and without obstacles but also proposed improvements to the DDPG algorithm, incorporating transfer learning [
12] to accelerate the algorithm’s convergence rate. In a notable advancement, Xu Jing et al. developed a model-driven DDPG algorithm [
13] that replaced explicit reward functions with a fuzzy system (FS), enabling the successful accomplishment of pin insertion tasks using a six-degree-of-freedom robotic arm. In this study, the states in the DRL algorithm included observations of interactive forces and momentum during task execution, and a corresponding fuzzy reward system (FRS) was designed to achieve continuous control over the action space. This method was validated in both simulation and real-world tests, achieving a 100% success rate.
Advanced algorithms such as proximal policy optimization (PPO) and DDPG have undeniably made strides in improving the efficiency of robotic arm control. Yet, they grapple with prolonged training cycles and languid convergence rates in real-world applications [
14]. Additionally, the utility of these model-based DRL methods is somewhat hampered by their limited adaptability across varied environments and tasks, a critical aspect in diverse industrial scenarios like textile manufacturing. Robotic arms in these settings encounter a spectrum of complexities and uncertainties. In contrast, the wide applicability of data-driven reinforcement learning, particularly end-to-end control mechanisms for robotic arms, garners significant attention. Nonetheless, the application of reinforcement learning to high-degree-of-freedom industrial robotic arms is not without its challenges. Chief among these is the inefficacy in action selection, contributing to extended training durations and delayed convergence. This core challenge is attributed to reliance on uncertain exploration methods dependent on reward-function modeling, which becomes particularly pronounced in environments with sparse reward structures.
To address the challenges posed by sparse rewards in robotic arm path planning, researchers have focused on enhancing the versatility of reinforcement learning algorithms. For instance, Kalashnikov D and colleagues introduced the QT-Opt algorithm [
15], which involved training a neural network to act as a robotic arm controller using data from 580,000 grasping attempts made by seven robotic arms. Yahya A and others proposed the Adaptive Distributed Guided Policy Search (ADGPS) [
16], enabling multiple robotic arms to train independently and share experiences, thereby reducing trial-and-error and finding optimal paths more efficiently. Additionally, Iriondo A et al. employed the Twin Delayed Deep Deterministic Policy Gradient (TD3) method [
17] to study the operation of picking up objects from a table using a mobile manipulator. Ranaweera M and colleagues enhanced training outcomes through domain randomization and the introduction of noise during the reinforcement learning process [
18]. These methods share a core principle of incorporating probabilistic approaches to significantly reduce the impact of ineffective actions. However, they still necessitate an extensive exploration time and can result in unproductive actions.
In pursuit of solutions to mitigate ineffective exploration in reinforcement learning, a cohort of researchers has turned to transfer learning paradigms. Notably, Finn C et al. formulated the guided cost learning (GCL) algorithm [
19], predicated on the MaxEnt IOC framework, to enforce trajectory constraints on robotic arms with an innovative twist: utilizing human-demonstrated paths as optimal guides for training. Expanding this concept, Ho J and team introduced the Generative Adversarial Imitation Learning (GAIL) algorithm [
20], adeptly selecting trajectories closely mirroring human demonstrations, thereby curtailing inefficient maneuvers and enhancing the training process’s speed. In a similar vein, Sun Y et al. ingeniously melded DQN with behavioral cloning to develop the D3QN algorithm [
21], markedly diminishing exploration randomness in initial training phases. Furthermore, Peng X B et al. devised the deep mimic approach [
22], ingeniously segmenting the reward function into an aggregate of imitation-based exponential components, thereby refining the reinforcement learning process. Lastly, Escontrela A and collaborators unveiled the AMP algorithm [
23], which deftly dissects the composite reward function into separate components of imitation and objective, consequently boosting the practicality of action generation. These methodologies shine in utilizing viable trajectory optimization solutions as constraints in reinforcement learning; although, their generalizability tends to lag behind that of the primary class of methodologies in this field.
In an effort to tackle the inherent issues of learning efficiency and the efficacy of path planning in DRL algorithms applied to end-to-end control models, this research innovatively proposes a reward architecture grounded in fuzzy decision making. This framework is meticulously crafted to augment both the efficiency of the learning process and the effectiveness of exploration pathways. Critically, the integration of a cascaded FRS significantly bolsters the precision and resilience of path planning, marking a notable advancement in the domain of DRL.
This research makes significant contributions in the field of robotic control, which are enumerated as follows:
(1) It innovates a multifaceted FRS that intricately considers aspects such as positional accuracy, energy efficiency, and operational safety, thereby enabling a more nuanced representation of a robot’s endpoint dynamics.
(2) It pioneers the application of this FRS in a cascaded format for specific operational tasks, culminating in the development of a groundbreaking cascaded fuzzy reward architecture.
(3) It applies this novel cascaded fuzzy reward system (CFRS) within an end-to-end control paradigm, where its practical effectiveness in facilitating end-to-end planning is rigorously demonstrated.
This manuscript is systematically structured as follows: The second section sets the stage by elucidating the research backdrop, focusing on the intricate details of FRSs and the core tenets of end-to-end DRL. The third section rigorously outlines the architecture and theoretical underpinnings of the CFRS. Section four validates the proposed methodology’s efficacy and scalability through a series of methodical experiments in a simulated setting, highlighting the significant reduction in collision rates to near 5% and showcasing the capabilities of the end-to-end self-supervised learning framework within the realm of model-free DRL. The paper culminates in the fifth section, which synthesizes the research findings and casts a vision for prospective avenues of inquiry in this domain.
3. Cascaded Fuzzy Reward System (CFRS)
One crucial aspect of DRL algorithms is the reward function, which fundamentally shapes the agent’s learning strategy and the direction for network optimization. Crafting an ideal explicit reward function to meet long-term goals is a formidable task, chiefly because mapping relationships from complex state spaces to reward values can be nonlinear, making the manual description of the relationships between reward components highly challenging [
33]. Initial research focused predominantly on single-objective optimization, primarily centered on position control [
34,
35], simplifying the reward function as follows:
Here,
represents the Euclidean norm,
is the position vector of the target yarn spool,
denotes the current position vector of the Tool Center Point (TCP), and
refers to the position error of the TCP. However, solely using the reward function defined in Equation (2) to guide the Critic’s assessment of the current strategy proves insufficient [
2]. In specific scenarios like path planning for textile robotic arms, multiple factors such as energy consumption and safety need consideration, often entailing conflicts and trade-offs. Therefore, a flexible approach is required to balance these aspects. Researchers have proposed multi-objective optimization methods [
36], balancing multiple factors through the linear combination of different objective functions. However, this approach can render the model complex and difficult to interpret, and computational efficiency becomes a concern in large-scale problems. To address this, Xu Jing et al. introduced a method known as the FRS [
13]. Integrating prior expert knowledge into the reward system, FRS can comprehensively evaluate various aspects of robotic assembly tasks. Not only does this system enhance learning efficiency, but it also prevents the agent from getting trapped in local optima. However, this method might face difficulties in handling nonlinear and conflicting objectives.
This section will detail the additional factors considered, the philosophy of the FRS, and its construction process.
3.1. Additional Factors
In general, the cost function for safety should be a non-negative function that decreases as safety increases. It should also be smooth, i.e., its derivatives are continuous throughout its domain, as many optimization algorithms, such as gradient descent, require the function’s derivatives. Based on this analysis, we have defined the following sub-functions for the safety cost:
Cost function for motion range:
where
represents the angle of the
ith joint, and [
define a safe motion range for that angle. This function rapidly increases as the joint angles approach their limits, with
as a parameter adjusting the function’s growth rate.
Cost function for safe distance:
where
is the minimum distance between the robotic arm and the nearest person or object in the environment,
is a predefined safe distance, and
is a parameter adjusting the function’s growth rate. This function is monotonic, increasing as the distance between the robotic arm and other objects or humans in the environment decreases.
3.2. Fuzzy System
In this study, we use these defined parameters as input for fuzzy evaluation: safety distance cost , motion range cost , and positional error . We have segmented the fuzzy sets of the FRS into five intervals: {VB, B, M, G, VG}, corresponding, respectively, to Very Bad, Bad, Medium, Good, and Very Good. For an FRS with three inputs, the number of fuzzy rules can be as high as 125. Managing such a large rule set is complex and time-consuming, potentially impacting the algorithm’s learning efficiency. Therefore, we adopt a two-layer FS, reducing the rule number for a 3-input FRS to 50.
As shown in
Figure 4, the first layer of the two-layer FRS includes an independent FS, taking
and
as inputs. The output of the first layer, combined with
, serves as the input for the second-layer FS, allowing further inference. The output of the second-layer FS represents the reward value integrating all three input factors. Within this two-layer FRS, the total number of rules in the system is reduced to 50. The aforementioned parameters are normalized within the range (0, 1) and input into the system. Triangular MFs, as per Equation (5), are used for fuzzification, transforming each parameter into five fuzzy values: VG, G, M, B, and VB.
The parameters , , and in Equation (5) represent values within the triangular MFs, where a and c determine the width of the function, and b determines its position.
After parameter fuzzification, fuzzy inference is conducted based on the established rule base. A rule base, as shown in
Table 1, is formulated based on the experiences of path planning for textile robotic arms. Additionally, the AND operation it is used for fuzzy inference.
In this context, A and B represent fuzzy sets. The AND operation computes the minimum of the membership degrees of both sets, and the output fuzzy value’s membership degree is determined based on the rule base. Subsequently, the obtained fuzzy values undergo a defuzzification process. Since numerous fuzzy values are produced following fuzzy inference, which are not directly usable; a defuzzification method is employed to obtain clear values that meet our requirements.
In our study, the centroid method is utilized for defuzzification. Concurrently, we introduce reward weights to balance the importance among different objectives. The weights for the safety distance cost, motion range cost, and positional error are designated as , , and , respectively.
The selection of
typically depends on the specific task requirements and environmental conditions. For instance, in a variable industrial environment where a robotic arm might need to respond to sudden situations, such as emergency obstacle avoidance in a task site, safety may take precedence over motion time, thereby prioritizing
over
. Conversely, if motion time is more critical than energy consumption then
would be favored over
. In an open warehouse environment where the robotic arm is responsible for moving heavy objects, the efficiency of the motion range becomes more significant, thus necessitating an increased weight for
to optimize path-planning efficiency. Meanwhile, due to less stringent safety requirements in such open spaces, the weight for
is comparatively lower. In a high-risk textile-workshop environment, where the robotic arm operates in tight spaces, safety is paramount. Therefore,
is assigned the highest weight to ensure the robotic arm maintains a safe distance and prevents collisions with surrounding objects. In contrast, the weights for
and
are relatively low in this scenario.
Here, represents the input state sequence, is the output after defuzzification, is the triangular membership function, and is the weight of the ith fuzzy rule’s output, with the reward weights being determined using objectives and experience.
In summary, the FRS avoids reliance on precise explicit reward functions while meeting the flexible control needs of the task. With simple adjustments, the FRS can also be adapted to various other application scenarios. Next, we introduce a cascaded structure into the FRS, thereby optimizing overall efficiency.
3.3. Cascaded Structure
In the path planning of intelligent robotic arms, particularly within the complex milieu of the textile industry, traditional integrated path-planning methods may encounter challenges such as high computational complexity, poor real-time performance, and weak adaptability to different task stages [
37,
38]. Therefore, this study introduces a novel CFRS, segmenting the entire path-planning process into distinct phases like initiation, mid-course obstacle avoidance, and alignment for placement. Each phase possesses its specific optimization goals and constraints, equipped with a dedicated fuzzy reward rule base.
In the initial phase’s fuzzy reward rule base, the priority is primarily on the robotic arm’s motion smoothness and safety. A key rule states “If the robotic arm’s speed is low and it is far from obstacles, then the reward is high”. The specific rules are illustrated in
Table 2, with the fuzzy logic system’s output depicted in
Figure 5. Such rules help ensure the robotic arm avoids sudden movements or collisions with objects in the environment during the initial stage.
During the mid-course obstacle-avoidance phase, the focus of the fuzzy reward rule base shifts to safety and energy efficiency. The rules may become more complex, considering multiple sensor inputs and the dynamic state of the robotic arm. A principal rule is “If the robotic arm maintains a safe distance from the nearest obstacle and consumes lower energy, then the reward is high”. The detailed rules are presented in
Table 3, with the fuzzy logic system’s output shown in
Figure 6.
In the alignment and placement phase, the fuzzy reward rule base emphasizes precision and stability. The rules in this stage are highly refined to ensure accurate alignment and secure placement of the target item. A leading rule is “If the end-effector’s positional error is within an acceptable range and stability indicators meet the preset threshold, then the reward is high”. These rules are detailed in
Table 4, with the fuzzy logic system’s output in
Figure 7.
To ensure coherence and efficiency in the path-planning process, the CFRS considers smooth transitions between different phases. This transition mechanism is based on the robotic arm’s current position, the target position, and the safety distance . Specifically, when the robotic arm approaches the goal of the current phase or the present safety distance becomes too short, it switches to the fuzzy logic of the initial phase or the placement phase.
4. Simulation Environment and Tasks
In this study, we constructed a highly physically simulated system, as illustrated in
Figure 8. To the right of the robotic arm is a desktop, upon which several obstacles and target yarn spools are placed. The spools have a diameter and height of 20 cm, and the rectangular obstacles measure 32 × 22 × 12 cm. To the left is a spacious platform, serving as the preparation area for the yarn-spinning machine, indicating the target placement location for the spools. Additionally, a horizontal beam is situated above, which the robotic arm should avoid colliding with during operation. Moreover, a humanoid model is placed nearby as one of the obstacles in the task environment.
During the initial phase of the transfer task, the robotic arm is positioned at point A or its vicinity. Additionally, a red yarn spool, the object to be grasped and placed, is located beneath point B on the table.
The process of grasping and placing the yarn spool involves a series of precise actions. Firstly, the robotic arm’s end effector must vertically descend, insert its claws into the central hollow of the spool, expand them, and then lift the spool using frictional force, eventually placing it at point C on the left platform with appropriate posture. To increase the generalizability of the problem, only one target spool is placed in the virtual twin platform (directly below Region 1), but its position may vary in different simulations to accommodate diverse possibilities.
The movement from point A to B involves the end effector of the robotic arm moving a short distance along the -x axis, then along the y-axis, followed by continued movement along the -x axis, approaching the yarn spool with appropriate posture, and finally lifting the spool along the z-axis. If the robotic arm were to move directly from point A to B, it would inevitably collide with the beam in the environment.
Therefore, the entire grasping-placing task is divided into a continuous trajectory consisting of the following segments:
(1) The yarn spool appears at any location on the table, and the TCP of the robotic arm moves from the initial position A along the trajectory of segment 1 to the preparatory position B with appropriate posture. Point B is located above the center of the spool along the z-axis, potentially in any position within Region 1.
(2) The robotic arm moves along trajectory segment 2 from the preparatory position B to the placement position C (which can be randomly designated within Region 2), contracts its claw, and places the spool on the platform.
(3) The arm resets and moves back to the vicinity of the initial position A.
This simulation platform enables large-scale strategy training, yielding a rich and high-quality training dataset. These data can be directly applied to the trajectory planning and generation of real-world robots. More importantly, this simulation system not only aids in policy transfer and the implementation of safety constraints but also considers diverse production scenarios and environmental variables in simulations. This feature allows for providing more comprehensive and precise training data for real-world robot operations, thereby validating the robustness and reliability of robots under various environmental conditions. Additionally, the system allows for preliminary testing and optimization of safety and stability in a safe, controlled virtual environment.
5. Experiments
According to the specifications described in
Section 4, we constructed a comprehensive simulation experiment environment. This environment utilizes RGB and depth images captured using three Kinect cameras as the state inputs for the network model, and joint space variables as the output control commands for the network model. The training termination criterion of the network model is set such that the distance between the target and the end effector is less than 10 mm, and the maximum Euler angle of the target relative to the end effector is less than 3 degrees. Throughout the simulation cycle, targets are randomly set within the operational space of the robotic arm. Simultaneously, the model undergoes training of the deep neural network based on feedback from the Kinect cameras and executes grasping tasks with various objects.
Our experiments addressed the following questions: (1) How does the end-to-end DRL model for planning compare with other manually programmed DRL model baselines? (2) Can the end-to-end-planning DRL model learn multiple viable planning behaviors for unseen test scenarios? (3) Can the CFRS, compared to a single-rule FRS, further enhance performance?
5.1. Baseline Comparison with End-to-End DDPG
This subsection will present and discuss the results of the comparison experiment between end-to-end DDPG and baseline DDPG.
In the experiment, baseline DDPG [
39] was trained on trajectories 1, 2, and 3, described in
Section 4, and named BS
1, BS
2, and BS
3, respectively. These three segments were sequentially concatenated to form BS
0, with its experimental results determined solely using the data from BS
1, BS
2, and BS
3, without independent experiments. Subsequently, the end-to-end DDPG, described in
Section 2, was used to train on trajectories 1, 2, and 3, named ETE
1, ETE
2, and ETE
3, respectively. Finally, end-to-end DDPG was used for comprehensive training on trajectories 1, 2, and 3, named ETE
0. The parameters for the aforementioned DDPG are provided in
Table 5.
In our study, we conducted a quantitative analysis of success rates and collision rates. Specifically, two algorithm models, each trained through 400 k iterations, were applied to different trajectories within the same scenario. During the experimental process, we conducted 2000 experiments in a simulated environment using the reward function defined using Equation (2). The success rate was calculated based on whether the robotic arm reached the target point within each episode, i.e., an error distance of less than 10 mm, and the percentage of all experimental cycles in which the target was achieved. The collision rate was determined using the percentage of experimental cycles in which the robotic arm collided with the surrounding environment.
As shown in
Figure 9, within the same algorithm, trajectory 2 required less time, and exhibited better success and collision rates compared to the other two trajectories, indicating its relative simplicity. This finding was corroborated by the description in
Section 4. Notably, the time consumption for BS
0 in the table is the sum of the time taken for the three baseline segments, with the success rate and collision rate being the average of these three baseline performances. In test scenarios, our end-to-end DRL model achieved a success rate of 90.1% and a collision rate of only 5.7%. Despite a slight increase in training time, the end-to-end DDPG showed significant advantages in terms of task success rate and collision reduction compared to the baseline model. This can be attributed to the deep network’s precise mapping of state–action relationships and efficient execution of more complex tasks. Next, we will comprehensively assess the adaptability of the algorithm model to environmental changes in complex environments.
5.2. Generalization Ability
In DRL, a model’s generalization ability is a critical evaluation metric [
40]. To empirically explore the adaptability of our model in different test environments, we constructed a series of test scenarios with varying complexity (e.g., number and distribution of obstacles) and conducted quantitative tests on the strategy success rates of both algorithms under different obstacle conditions. The results are shown in
Table 6.
It is evident that the performance of strategies is negatively impacted by increased occlusion and complexity in obstacle-avoidance tasks. Notably, these high-complexity scenarios presented greater challenges for traditional baseline methods, mainly because these methods rely on manually extracted, localized state inputs. Such limited information is insufficient to accurately represent the robotic arm’s motion dynamics and potential conflicts in complex environments. This limitation further highlights the superiority and robustness of our proposed end-to-end DRL model in complex scenarios.
Relatively speaking, our model, capable of comprehensive analysis of and feature extraction from RGB and depth images, offers a higher-dimensional state space representation. This enriched state representation allows the model to identify effective trajectory planning and execution strategies in environments with more obstructions and obstacles. Further, we will introduce the CFRS, demonstrating significant improvements in the model’s convergence and robustness during training.
5.3. Cascade Fuzzy Reward System
To address the challenges of sparse rewards and signal delays in DRL, this study introduces a CFRS based on fuzzy logic. We conducted ablation experiments comparing the CFRS with both a unique reward system (URS) and a unique fuzzy reward system (UFRS).
Data from
Figure 10 reveals that models employing URS converge more slowly, achieving convergence around 250 k episodes, indicating limitations in global optimization. In contrast, UFRS models converge within approximately 120 k episodes, demonstrating faster optimization speeds. Most notably, CFRS models converge within just 100 k episodes, illustrating the effectiveness of cascaded fuzzy rewards in efficient signal propagation and rapid global optimization.
Figure 11 focuses on the trends in collision rates under different reward systems. The URS model initially experiences a rapid decrease in collision rates but later stabilizes around 20%, possibly due to settling at local optima induced by its simplistic reward mechanism. UFRS shows a significant initial decrease in collision rates, with fluctuations within the 15–20% range, indicating some robustness in dynamic environments. CFRS, on the other hand, exhibits a continual decrease in collision rates, eventually stabilizing at a low level of around 5%, further proving its robustness and efficiency in practical operations.
In summary, the introduction of the CFRS significantly enhances the model in terms of convergence speed, stability, and reduced collision rates. This design not only promotes rapid global optimization of the algorithm but also enhances the model’s robustness and reliability in complex environments. Therefore, the CFRS provides an efficient and robust reward mechanism for DRL in complex tasks such as robotic arm path planning.
5.4. Real World Experiment
In this section, we present a series of experiments conducted in real-world environments to validate the practical performance of our model. The objective of these experiments was to assess the model’s capability in handling path planning and task execution in actual physical environments.
The experiments involved maneuvering a robotic arm from various starting points to designated target locations, as illustrated in
Figure 12. These images demonstrate the model’s performance in navigating narrow spaces and executing complex path planning. Key performance indicators, including the success rate of tasks, precision in path planning, and the time taken to complete tasks, were recorded and presented in graphical form to validate the efficacy of the model. Comparative analysis indicates that our model outperforms traditional methods in handling specific challenges, highlighting its potential in real-world-application scenarios.
In summary, these experimental results validate the practical performance of our model, providing a foundation for future applications in similar environments.