1. Introduction
As society advances, the frequent disposal of both premium and everyday products poses a significant environmental threat, highlighting the urgency of incorporating sustainable practices into product lifecycle management. The direct disposal of such products not only contributes significantly to environmental pollution but also depletes finite resources, exacerbating environmental issues that disrupt people’s daily lives. Dismantling these products becomes imperative, with a focus on extracting valuable end-of-life (EOL) components that contribute to mitigating environmental pollution [
1]. However, to truly address the environmental impact of product disposal, a holistic approach rooted in sustainable principles is essential.
Embracing sustainable principles in product disassembly involves more than just extracting valuable components. It requires adopting strategies that prioritize resource efficiency, waste reduction, and environmental conservation throughout the entire product lifecycle. By optimizing disassembly processes and promoting the reuse of EOL components, not only can environmental pollution be minimized, but valuable resources can also be conserved [
2]. This shift toward sustainable disassembly practices aligns with the overarching goal of sustainable resource management and environmental conservation, reflecting a commitment to preserving ecosystems and promoting long-term societal well-being.
Incorporating sustainable considerations into the disassembly line balancing problem is essential for achieving lasting environmental benefits. By optimizing disassembly processes to maximize resource recovery and minimize waste generation, manufacturers can contribute to a circular economy model where resources are used more efficiently and environmental impacts are minimized. Moreover, by extending the lifespan of products through effective disassembly and component reuse, the need for raw material extraction and the production of new goods can be reduced, further advancing sustainable practices and mitigating environmental degradation [
3].
In the contemporary era, escalating concerns about pollution have prompted the adoption of effective measures to alleviate its impact, aligning with the principles of sustainability. One such approach involves the disassembly and reprocessing of recyclable products, which not only mitigates pollution but also conserves valuable resources and promotes environmental sustainability. This necessitates addressing the disassembly line balancing problem, which exhibits variations across diverse real-world scenarios. Various metrics, including revenue maximization, minimization of the disassembly time, and optimization of the workstation smoothness index, among others, come into play in sustainable disassembly practices. Formulating a tailored mathematical model based on specific situations allows for a targeted resolution of the associated disassembly line balancing problem, integrating sustainability considerations into decision-making processes.
An array of disassembly methods exists, encompassing complete and partial disassembly, along with both destructive and non-destructive approaches [
4]. Different disassembly lines, such as linear, bilateral, parallel, and U-shaped configurations, offer distinct approaches, each with its own implications for sustainability. Designing specific disassembly lines for different facilities facilitates the examination of balancing problems while ensuring that sustainable practices are upheld throughout the process. Currently, there is widespread exploration of mixed lines across various sectors of society, reflecting the growing recognition of the importance of sustainability in industrial practices. Given the diversity of recyclable products and the imperative to enhance disassembly and recycling practices, the implementation of mixed disassembly line methods, integrating linear and U-shaped configurations, has gained traction. This endeavor aims to better serve society, purify the environment, and contribute to the establishment of a green ecosystem, aligning with the overarching goals of sustainable development and promoting a circular economy model.
In this paper, our focus is directed toward tackling the two-sided disassembly line balancing problem (DLBP) [
5,
6]. The conventional layout of a disassembly line is depicted in
Figure 1. The literature addressing two-sided disassembly lines, particularly in the realm of human–robot collaboration, remains sparse. Furthermore, there is a scarcity of research applying reinforcement learning to tackle this specific problem. Previous work by Xie et al. [
7] considered the two-sided DLBP with constraints on workstations and energy consumption, utilizing a differential evolutionary algorithm for solving the multi-objective two-sided DLBP. However, this method is more suitable for continuous problems, while DLBP is inherently discrete. As a result, transformation into a discrete algorithm is necessary for effective problem resolution. Wang et al. [
8] explored modeling based on two-sided disassembly lines and proposed a solution using a genetic algorithm with variable neighborhood search, enhancing the algorithm’s convergence speed. Another work by Wang et al. [
9] delved into DLBP modeling and solutions, investigating cases involving uncertain disassembly times, unknown beat times, two-sided pattern layouts, and the presence of disassembly obstructions between parts. They employed a hybrid artificial bee colony algorithm along with a variable neighborhood-depth search strategy. Zhao et al. [
10] studied the parallel DLBP considering carbon emission, establishing a hybrid graph-based disassembly model, and used a genetic algorithm to solve the problem. However, these studies do not account for human–robot collaboration, primarily focus on multi-objectives that cannot be addressed with the CPLEX solver, and employ heuristic algorithms. Therefore, our model in this paper integrates human and robot collaboration to formulate a corresponding model, and the algorithm utilizes current popular reinforcement learning techniques to effectively address the problem.
Many existing articles resort to intelligent algorithms, such as the artificial bee colony algorithm, elite differential evolution algorithm, and others, to address the disassembly line balancing problem (DLBP) [
11]. Although these algorithms can yield optimal solutions, they frequently require numerous iterations, potentially leading to instability in their final results. In this paper, we employ a reinforcement learning algorithm from machine learning to address disassembly line balancing across small, medium, and large scales. Reinforcement learning, currently recognized as a more efficient artificial intelligence method [
12], emerges as the preferred choice for resolving the DLBP in selective disassembly, ensuring quicker attainment of optimal solutions and thereby enhancing efficiency. Given the escalating societal attention to the DLBP, a plethora of algorithms is being applied in this domain, with a growing emphasis on the efficiency of delivering optimal solutions. Reinforcement learning has become widely utilized in path planning problems [
13,
14]. Simultaneously, the integration of reinforcement learning and disassembly lines has garnered growing interest from academia and industry alike [
15,
16]. In this paper, we utilize the Q-learning algorithm to address the DLBP and validate the algorithmic results by solving the corresponding mathematical model with CPLEX. The outcomes demonstrate that the Q-learning algorithm exhibits a rapid solution speed and yields optimal solutions.
The contributions of this paper are primarily encapsulated in the following aspects:
(1) A corresponding mixed-integer programming model is established. Existing solutions to the disassembly line balancing problem often involve multi-product and multi-objective formulations, posing challenges for solvers like CPLEX. This paper adopts a single-objective linear mathematical model, facilitating a solution with CPLEX, thus ensuring accurate verification and mitigating potential errors.
(2) A reinforcement learning approach is introduced to solve the DLBP, a method less commonly utilized in current practices. A novel Q-table is defined, enabling the algorithm to learn the environment based on the objective function.
(3) The model’s accuracy is initially validated with CPLEX, followed by a comparison of the algorithmic results with those obtained from CPLEX. This two-step validation process ensures the reliability of the proposed algorithm and guards against the possibility of failing to identify the optimal solution.
The subsequent sections of this paper are structured to provide a comprehensive understanding of our research. In
Section 2, we offer a detailed description of the problem, accompanied by the formulation of the mixed-integer programming model. Furthermore, we provide relevant illustrations for a detailed explanation of the process. Moving on to
Section 3 we delve into the principles of the Q-learning algorithm and elucidate its algorithmic procedure. This section particularly emphasizes the application of the Q-learning algorithm to effectively address the disassembly line balancing problem (DLBP). In
Section 4, we thoroughly analyze the results obtained using our algorithm and compare them with the outcomes derived using the CPLEX solver. This comparative analysis aims to showcase the performance of our algorithm. Finally,
Section 5 serves as the concluding segment of this paper, summarizing the key findings and contributions presented in the preceding sections.
3. Q-Learning Algorithm and Application
3.1. Q-Learning Algorithm Description
This subsection introduces the principles and processes of the Q-learning algorithm [
19,
20,
21]. The Q-learning algorithm is a widely used method in reinforcement learning, with a common example being its application to maze exploration to locate a treasure [
22,
23]. In the realm of reinforcement learning, the intelligent agent continuously interacts with an unknown environment. Two fundamental concepts, namely state and action, are introduced. The agent takes actions with corresponding values to continuously maximize rewards. The Q-learning algorithm is employed to store information learned about the environment using a Q-table. In each iteration, the action with the highest Q-value in the current state is chosen for state transition. The agent takes action
based on the current state
. An action value is associated with each corresponding action [
24]. Different return values are obtained based on the state transition and the action taken by the agent. These return values are stored in the Q-table after comprehensive environment learning, allowing the agent to select the appropriate action for state transitions.
In reinforcement learning, the fundamental concept is to enable the agent to interact with the environment to gather information. When the agent reaches a particular state, the information within that state is recorded. This facilitates a better selection of states and actions in the broader environment later on. Essentially, reinforcement learning involves the continuous process of allowing the intelligent agent to explore and make mistakes. As the agent accumulates enough experience, denoted as
E, in the Q-learning algorithm, this translates to the constant updating of values in the Q-table [
25,
26].
The core idea of the learning algorithm is as follows:
The essence of this formula lies in the Bellman equation, which represents the transformation relationship of the value–action pair. is referred to as the Temporal-Difference (TD) target, where denotes the TD deviation, commonly known as the TD error. Here, represents the learning rate, and is the reward decay coefficient. This formula undergoes updates using the TD method, placing it under the category of TD algorithms.
The aforementioned equation constitutes the Q-learning update formula. It relies on the next state, where the largest Q-value for the action is selected, multiplied by the decay
, and added to the actual return value corresponding to the most realistic Q-value. The update is based on the
from the previous Q-table, serving as a Q-value estimation. Through this formula, we continuously extract information from the environment, updating the Q-values in the corresponding positions of the Q-table based on the acquired information, whether positive or negative. The Bellman equation involves the state value function in applying this algorithm, which we denote as follows:
The Bellman equation for this state function is expressed as
The state function represents the value of the current state, taking into account not only its own state value but also the cumulative sum of values from all states that can be reached in subsequent steps. At each step, we multiply by a decay factor , signifying the proportion of the later state values. A larger implies a greater influence of later state values on the Q-value in our Q-table, emphasizing their significance. This allows us to derive the optimal state value function, denoted as , based on the state function.
Applying this definition of the state function to our disassembly line context involves subtracting the disassembly time from 100 for each position. This transformation ensures that smaller time values correspond to larger values inside the Q-table, following the learning method. Ultimately, the optimal disassembly sequence is obtained based on the greedy strategy.
The pseudo-code for the Q-learning algorithm is shown in Algorithm 1.
Algorithm 1: Q-learning Algorithm |
Input: Q(s, a), for all s ∈ S, a ∈ A, arbitrarily and Q(terminal-state,-) = 0, α, γ |
Output: a new Q(s, a) |
Repeat (for each episode): |
Initialize S |
Repeat (for each step of episode): |
Choose A from S using policy derived from Q (e.g., ε − greedy) |
Take action A, observe R, S′ |
Q(S, A) ← Q(S, A) + α[R + γmaxaQ(S′, a) − Q(S, A)] |
S ← S′ |
Until S is terminal |
The algorithm iterates the model with the goal of selecting the action with the maximum value for the next state.
The action selection for the next state is executed using the -greedy algorithm, placing it within the category of offline learning algorithms. By choosing the maximum action value as the target value, the direction of the value update is directed toward the optimal .
This approach relies on the value function, facilitating iteration calculations through the value function. Following the optimization strategy based on the value function described above, we obtain the action corresponding to the optimal value.
3.2. Q-Learning Combined with Disassembly Lines
From the aforementioned explanation, it is evident that the algorithm is well suited for solving discrete problems, and the disassembly line balancing problem (DLBP) is inherently a discrete problem. Therefore, the algorithm is indeed applicable to solve the DLBP. This paper specifically aims to minimize the total disassembly time. The mathematical transformation of small time values into relatively large values allows the algorithm to effectively select the state with a smaller time at each step, aligning with the objective function. To integrate the task AND/OR graph into reinforcement learning, we treat the set of selectable actions as the next set of states that the current state can reach. Subsequently, the algorithm proceeds to solve the disassembly line problem, and the steps are described in detail below based on the aforementioned algorithm flowchart.
Step 1:
Initialize the Q-table by constructing a two-dimensional array with n columns and m rows. The columns represent the number of actions, and the rows represent the number of states. The value of each position in this two-dimensional array is initially set to 0. According to the DLBP and AND/OR graph, we set the state to the demolition task and the action to the next task. So, the Q-table is a square array.
Step 2:
Select an action by choosing an action in state S according to the Q-table. However, at the very beginning, each Q-value is set to 0. The state values reachable from this state are set to the corresponding reward values according to the specificities of the DLBP. At the beginning, the agent explores the environment and randomly chooses actions. This is because the agent knows nothing about the environment at this time. As the agent continues to explore the environment, the probability of making a random choice decreases, i.e., the agent is able to learn about the environment.
Step 3:
Execute an action. In the disassembly line balancing problem, this starts with all states being reachable, so a random task is chosen to perform. In the process of continuous exploration, the agent’s estimation of the Q-value becomes more and more accurate, i.e., after a certain number of explorations, the algorithm can find the optimal solution.
Step 4:
Evaluate. Now that the action has been taken and the result and reward observed, the function now needs to be updated according to the algorithm. Finally, after our pre-set exploration steps have been completed, the agent is now ready to exploit the environment. At this point, we obtain the Q-table, which represents the optimal Q-values. The agent now selects better actions, and at this point, the subscript corresponding to the largest value in a row of the Q-table is the optimal action for the current state. According to the model mentioned in this paper, the algorithm stops when the gain criterion is satisfied, which occurs when the agent completes a state transition.
Figure 4 depicts the evolution of Q-values for reachable states in the Q-table corresponding to state 0, using the copier assembly example solved in this paper. In accordance with the approach presented in this paper, the constructed task 0 can reach tasks 1, 2, 16, and 28 through the AND/OR graph. The iterative graph of Q-values reflects this connectivity. By observing the iterative graph, it becomes apparent that stability is almost reached around 2500 training steps, and the final state 1 attains the highest Q-value. Consequently, we opt to begin the disassembly process with task 1.
The following describes how to improve the Q-learning algorithm to solve the DLBP, using the example of the washing machine AND/OR diagram from the previously mentioned paper.
In the original Q-learning algorithm, the Q-table is initialized to guide the algorithm on how to reach the destination. The initial values are set to 1 for reachable states, −1 for unreachable states, and 100 for reaching the destination. We construct an initial table based on this logical relationship and set the total number of learning steps according to the value iteration principle, ultimately obtaining the Q-table after learning. However, this approach assumes a clear destination, and for our problem, it is not suitable since the algorithm should not terminate only when a specific task is reached. Instead, we should use a constraint as a termination criterion. Given that Q-learning selects the action with the maximum value in each state, we can modify the approach for our problem as follows:
(1) Each state selects the action with the smallest value in the corresponding row of the Q-table, meaning the action with the smallest time consumed. The initial table stores the time consumed by each task, ensuring that the state transition selects an action with minimal time.
(2) According to the original Q-learning algorithm, each time in the Q-table corresponds to the maximum value in a row for an action. In this case, the initial table stores 100 minus the time consumed by that task. However, this corresponds to selecting the action with the largest value, which, in our model, represents the task that actually takes less time. This completes the state transition. In this paper, we adopt the second approach.
3.3. Definition of Reward Matrix R
According to the previously mentioned washing machine AND/OR diagram, the case involves 13 tasks [
27], corresponding to a total of 13 states in the algorithm. Each state can have 13 actions, meaning that each task can choose from 13 tasks for state transition. Therefore, we construct a 13 × 13 two-dimensional matrix. If a state cannot be reached, we set the value of the corresponding position to −1, and if it is the same as itself, we set it to 0. For states that can be reached, we set the value to 100 minus the disassembly time spent on the corresponding task. From the figure, it is apparent that task 1 can reach tasks 2, 3, and 4. Since the array’s memory starts with a subscript of 0, we set the values of [0,1], [0,2], and [0,3] in the array to values of 100 minus the disassembly times spent on tasks 2, 3, and 4, respectively. This ensures that the agent can learn effectively. In the end, at each step, the position with the largest value in the Q-table is selected, and the corresponding state is the task number to be performed next. This completes the learning of the example, representing the environment. From this arithmetic example, we can summarize the method of defining the matrix
R = [
] to be learned using the following approach:
Based on this definition, the
R matrix of the washing machine is as follows:
Based on the matrix obtained from the AND/OR graph, the agent interacts with the environment multiple times, and the environment provides feedback values to the agent. This process results in a Q-table, which stores the Q-values for the actions that can be selected in each state. Using the greedy algorithm, we select the action corresponding to the largest Q-value from the starting state to reach the next state, which represents the next task. This completes the state transition. The algorithm terminates when the gain constraint is satisfied. The optimal solution is then output based on the achieved results.
The algorithm’s learning process based on the initial table is as follows: assuming 2000 learning steps, a learning rate of
= 0.5, and a future return decay value of
= 0.9. After training, the following matrix is obtained:
In the table, we can observe that the maximum value in the first row is 256.7, corresponding to action 2. Therefore, the next step is to select task 2 for disassembly. Moving to the second row, the maximum value is 178.6, so we select action 5 for state transition, corresponding to disassembly task 5. Moving to the fifth row, the maximum value is 94, so we select action 11 for state transition, corresponding to task 11. When we disassemble to task 5 and satisfy the disassembly profit, we meet the algorithm’s stopping criterion and thus output the optimal solution. At this point, the corresponding disassembly sequence is 1 → 2 → 5.
4. Case Introduction and Results Analysis
This section introduces six different cases of AND/OR graphs and describes how they are incorporated into the Q-learning algorithm. The goal is to address cases of varying sizes to cater to the specificity of the problem and account for different scenarios. The results of the respective runs are analyzed based on the varying case sizes, and conclusions are drawn [
28]. The model was solved using both the Q-learning algorithm and IBM ILOG CPLEX. The Q-learning algorithm was coded in Python and implemented on a computer with the configuration: Intel(R) Core(TM) i7-10870H @ 2.30 and Windows 10 operation system.
For the initial case, a small example featuring 13 tasks related to ballpoint pens is utilized for testing [
29]. The AND/OR graph for this small ballpoint pen case is illustrated in the
Figure 5 below, denoted as Case 1.
Figure 6 shows the part diagram of the hammer drill, from which its main components can be seen.
In the second case, a ballpoint pen with 20 tasks is considered. Since subassembly 1 can perform both task 1 and task 2, an additional task 0 is introduced into the algorithm. Task 0 can reach tasks 1 and 2. Consequently, there are 21 tasks and 24 subassemblies, leading to the construction of a 21 × 21 two-dimensional matrix. This case is denoted as Case 2 [
30,
31]. For the third case, a slightly larger copier with 32 tasks is used. According to the AND/OR graph, subassembly 1 can handle tasks 1, 2, 16, or 28. Therefore, task 0 is constructed in the algorithm, allowing it to reach tasks 1, 2, 16, and 28. This results in the construction of a 33 × 33 matrix for the algorithm to learn. This case is referred to as Case 3 [
32,
33]. In the fourth case, a rotor is considered, featuring 15 tasks and 17 subassemblies. A 15 × 15 two-dimensional matrix is constructed based on the AND/OR graph for the algorithm to learn. This case is referred to as Case 4 [
34,
35]. The fifth case involves a slightly larger radio with 30 tasks and 29 subassemblies. To enable task 0 to reach tasks 1 and 2, a 31 × 31 two-dimensional matrix is constructed for the algorithm to learn. This case is denoted as Case 5 [
36,
37]. Finally, a large example of a hammer drill with a partial structure is utilized for the sixth case. This case comprises 46 tasks and 63 subassemblies [
38,
39]. A 46 × 46 matrix is constructed for the algorithm to learn. This case is referred to as Case 6.
Table 1 presents a detailed description of all the cases, including the various parameters for small, medium, and large products, respectively.
Table 2 describes the algorithm parameter settings for each product.
Table 3 shows the results of the algorithm for each case and the results obtained with the CPLEX solver.
Table 4 presents a comparison of the results of the Q-learning algorithm and the SARSA algorithm for six cases. The latter part of
Table 4 presents the comparison results between the improved Q-learning algorithm and the original Q-learning algorithm for six cases.
Figure 7 depicts the convergence of the objective function for Q-learning and Sarsa algorithms in Case 6. As the objective function of this chapter aims to minimize the maximum disassembly time while ensuring a certain disassembly profit, smaller values of the objective function are preferable. From the graph, it can be observed that after stabilization, the objective function value of the Q-learning algorithm is smaller, indicating that it obtains better feasible solutions. Additionally, the stability of the Q-learning algorithm is superior to that of the Sarsa algorithm.
Figure 8 shows the convergence of TD error for the Q-learning and Sarsa algorithms. It can be observed that the Q-learning algorithm reaches convergence after approximately 1250 training steps, whereas Sarsa remains unstable even after 2000 steps.
From the above tables, it is evident that the improved Q-learning algorithm can indeed find the optimal solution, and its running time is much faster than that of CPLEX across various case scales. Additionally, we can see in the table that the running time of CPLEX grows linearly with the case scale, while the running time of the Q-learning algorithm increases with the case scale. The running time of the Q-learning algorithm does not change significantly as the case size increases, and the differences are negligible, ranging from 0.04 s to 0.61 s. Moreover, according to the results for the first five cases, the maximum running time of the algorithm is only 14% of the running time of CPLEX, which is a small percentage of time. Even in the second case, it is only 1%, showing that the algorithm can more efficiently find the optimal solution compared to CPLEX, which is crucial for companies or factories. Importantly, it can be seen from the experimental results that in small- and medium-scale cases, CPLEX provides the optimal solution similar to the algorithm.
However, when the last large-scale case with 46 tasks is carried out, CPLEX encounters a memory overflow issue.It can be seen that CPLEX has eliminated many rows and improved the boundary many times. However, because the scale of the calculation example is too large and the RAM in the running environment is insufficient, there is not enough memory to continue constructing the corresponding data in the future. The solution scale is too large for the case, and there are too many possible solutions, making it impossible to find a better solution, whereas the Q-learning algorithm can provide a better solution. This shows that the improved Q-learning algorithm can indeed solve the DLBP, and the solution it provides is optimal. In the future, the scale of our recycled products may become larger and may involve large products that need to be disassembled such as engines and used motorcycles. In this case, CPLEX may struggle to find a better solution, but the proposed algorithm can find a feasible solution in a very short time. This proves that it can address not only the recycling problem of small disassembled products but also the recycling problem of large used products. The proposed algorithm is not bounded by the size of the arithmetic case when solving the DLBP and can provide an optimal solution in a shorter time.