1. Introduction
A product manufacturing site involves various processes, machines, robots, and logistics. With the development of smart factory technology, many elements at manufacturing sites have become unmanned. In an unmanned line, a robot takes charge of materials input/return into/from the machines, and material supply is ensured using a scheduling logic that determines when and which materials to send to the machines. Therefore, efficient scheduling is essential for the timely supply of materials to the machines. Because robots are expensive pieces of equipment, they cannot be deployed in large numbers, and one robot should be able to handle as many machines as possible in one line. Material scheduling in such a multiprocessing line is a hybrid flow-shop scheduling problem because the processes are sequentially performed and several machines process in each stage. This study is aimed at dealing with material scheduling for supplying materials to machines in an unmanned multi-processing line, which is a hybrid flow-shop problem.
This problem can be defined as follows. First, the raw material to be processed sequentially is placed in a buffer. Subsequently, the raw material is removed from the buffer, placed into the machine of the first process, and returned to the buffer when the machine completes processing. Following this, the first-treated material is placed into the machine of the next required process. Accordingly, the material is sequentially processed by all processes. The scheduling problem to be addressed here refers to making the decision of into which machine to place a material in and when, so that the total processing time to complete the processing of all materials is optimized.
This study makes the following assumptions. The processes of a line must be sequential, and the order cannot be changed. Only one material can be processed at a time by a machine. If a material is loaded into a machine, a new material cannot be loaded to the machine until the first material is unloaded into the buffer. Several machines are used in each process, and one machine deals with one process.
Figure 1 shows an example of an unmanned multi-processing line consisting of
k processes,
C1,
C2, …,
Ck. We assumed that machines in the manufacturing process are positioned on both sides of the central rail. One robot transfer unit (RTU) moves along the central rail and takes charge of loading material into and unloading material from the machines. This means that the position of supplying the material changes in real time. As shown in
Figure 2, the RTU is equipped with a multi-joint robot arm to pick up materials and buffers for materials in processing and already processed, respectively. The buffer sizes may vary considering the space in the RTU. The RTU takes a tray of raw materials from the tray buffer and returns the tray of completed materials to the tray buffer after all materials are processed.
The problem covered in this study is the scheduling problem for a multi-process unmanned line. As mentioned above, there can be many machines in one line, which suggests that the size of the state and action of the environment is high, i.e., the environmental space is high-dimensional.
Material scheduling is a type of hybrid flow-shop scheduling problem. Heuristic methods have been proposed for hybrid flow-shop scheduling using dispatching rules [
1] and various meta-heuristic methods [
2,
3,
4,
5,
6,
7]. However, these methods have the disadvantages of being difficult in their application to high-complexity problems, difficult in finding optimal solutions, and having a long calculation time.
Reinforcement-learning-based approaches are also used for flow-shop scheduling. However, studies that have used these approaches have ignored practical factors, including factors that increase the complexity of scheduling problems [
8,
9], and studied in environments with a small number of machines [
9]. Some studies have ignored the transportation times of materials to machines and the moving times of robots [
8,
10,
11,
12,
13] or have assumed an infinite buffer size [
13,
14]. In addition, most previous investigations have not addressed the problem of high-dimensional environments arising from many machines to be scheduled; instead, they have considered only scheduling for a few machines [
9,
15]. Learning in an environment with numerous machines is problematic because the size of the state and the action space is large [
9]. Thus, the above approaches may be inapplicable to the material scheduling problem of a realistic unmanned line multi-process manufacturing unit with many target machines.
In this study, we propose an approach to material scheduling considering realistic constraints in high-dimensional environmental spaces with many machines. For this problem, which is difficult to learn, machines were grouped, and a prediction method for reward was developed. In addition, a reinforcement learning method considering material transfer time and buffer capacity was established to solve a real-world scheduling problem with realistic constraints. Accordingly, we showed that our approach can optimally schedule realistic material scheduling in multi-process lines with many machines.
2. Related Work
Shop scheduling problems can be classified according to classification criteria such as numbers of machines and resource constraints [
16]. They can be classified into job-shop, flow-shop, and open-shop scheduling problems. This study addresses flow-shop scheduling problems because each job proceeds in a certain flow sequence by machines [
17,
18]. Hybrid flow-shop scheduling problems are more complex than simple flow-shop scheduling problems. As shown in
Figure 3, hybrid flow-shop scheduling is a flow-shop problem with parallel machines, in which several similar or identical machines process in each stage. This study addresses a multi-process line scheduling problem [
10,
19,
20]. Several machines can be selected at a stage, and the processing time varies by the process; therefore, selecting the appropriate machine to perform the process every time is necessary.
Although scheduling has various objectives, it generally minimizes the total time required to complete all tasks [
17,
18,
21,
22]. Typically, shop scheduling is a problem that combines various tasks, necessary processes, and various machines to perform these processes. Previous studies have shown that the problem complexity is very high because the obtained solutions of shop scheduling problems are typically numerous; furthermore, choosing which of these is optimal is a more difficult NP-hard problem [
21,
23,
24].
Attempts have been made to solve scheduling problems with many different types of techniques, and approximation methods using heuristic and metaheuristic approaches have been used more frequently than accurate methods. However, they require considerable computational time [
11].
Heuristic techniques include dispatching rule methods, such as the first-come-to-a-queue, first-service rule (FCFS), shortest processing time rule (SPT), longest processing time rule (LPT), and smallest remaining processing time rule (SRPT) [
11,
22,
25]. Because of the advantage of simple and easy implementation, scheduling logic using dispatching rules is commonly used. However, in an environment with many exceptions and high complexity, it is hard to obtain high-performance schedules by applying only one basic priority rule.
Metaheuristic methods, such as Tabu search and genetic algorithm techniques, are independent of specific problems, and can be applied to various scenarios [
11]. A genetic-algorithm-based approach that employed additional local search methods using more than one crossover operator was proposed [
3]. Investigations based on the local neighbourhood parallel search method embedded in the Tabu search approach have also been performed [
2]. There has also been a study using branch and bound algorithms in the manufacturing of components. This was a study that reflected realistic factors, such as considering buffer restrictions, and did achieve a slight performance improvement over conventional heuristic methods [
5]. There has been a genetic-algorithm-based study that generated arbitrary neural networks capable of estimating task sequences [
6]. A study has also been conducted to study efficient rule generation using gene expression programming [
7]. However, generally, these methods have disadvantages: the difficulty in finding the optimal solution and the computation time increase required to obtain the optimal solution as the complexity of the problem increases. In our study, the location of materials that change in real time and a large environmental space increase the complexity of the problem.
Reinforcement learning techniques are being used to solve the above-mentioned flow-shop scheduling problem. There has been a paper that studied the enhanced e-greedy method designed as a trade-off of exploration and exploitation [
26]. A study has been conducted using algorithms based on time differences by designating basic priority dispatching rules, such as FCFS, SPT, LPT, and SRPT, as actions and assigning a negative reward if a machine is inoperable [
27]. One study used a method of constructing a reward function based on the idle time of a machine and approximating the value function with neural networks [
15]. A Q-learning-based method for action-value-function learning was proposed to minimize the completion time of all work hours [
28]. As an example of the application of reinforcement learning to hybrid flow-shop scheduling problems, some studies have used a Q-learning-based method to minimize the total machine time for transport scheduling [
14,
29]. A scheduling method for dispatching rules using a double-deep Q-network (DDQN) algorithm was developed [
13]. In addition, a method using a DDQN algorithm focusing on the number of completions instead of the processing time was established [
30]. There was a study using a sparse make-span minimization criterion with a combination optimization problem to solve the scheduling problem [
31]. There was a study that combined a neural-network-based value function approximation and the use of optimistic inter-agent coordination schemes [
32]. There have been studies that proposed a DQN approach to solve the precast concrete scheduling problem of minimizing total tardiness [
33]. There have also been studies on probabilistic dispatch policies using a set of real-valued parameters [
34]. However, as mentioned in
Section 1, previous studies have treated environmental elements as real-world constraints, and have not addressed scheduling in high-dimensional environments with many machines.
Through comparison, it was indirectly confirmed that the degree of improved performance compared to the heuristic methods in the previous metaheuristic [
5,
6,
7] and reinforcement learning [
31,
32,
33,
34] studies was lower than the degree of improvement in our study.
3. Proposed Method
This study aims to reduce the complexity of a real-world hybrid flow-shop scheduling problem and enable scheduling for multiple machines. When the scheduling problem is applied to the Markov decision process, the transition probability of the state change cannot be obtained in the target environment considered in this study. Hence, the problem was solved using a model-free method [
35].
In this section, first, to apply reinforcement learning to real problems, a method of effectively representing the state of reinforcement learning is developed considering the realistic elements which were excluded in previous studies. Subsequently, a method of reducing the complexity of the problem is established by reducing the environmental space of the state and action of reinforcement learning. Finally, rewards that can alleviate the learning difficulties in applying scheduling problems to reinforcement learning are defined.
3.1. State Definition
Many previous scheduling studies considered only the machine operation time, such as the processing/idle time. This is because realistic factors such as material transfer time and buffer limitations render the scheduling problem more complex and difficult. However, without considering these realistic factors, a realistic scheduling problem cannot be ensured to have been solved.
First, the buffer state according to the limitation of the buffer size should be considered. A buffer is the space where raw materials to be processed are located and simultaneously stored until the next process operation is due. Because a buffer is physically limited in size, depending on its material state, removing a processed material from a machine or placing a material into a machine may be impossible. Therefore, during scheduling, the target machine cannot be determined simply based on the completion of processing; a buffer state is also required. The next factor to consider is the RTU, which transfers the material. In this study, the considered line is an unmanned multi-process line in which one RTU is responsible for material input/return through multiple machines. Because the location of the RTU changes in real time, determining the machine to which the optimal action is assigned to is practically impossible. The limitation of the buffer size and the location of the RTU that changes in real time correspond to the constraints of the real world. Considering the above, the state of the system is defined using the data generated from the machines and RTU, as shown in
Figure 4.
When the total number of processes (the number of stages) is k, the state information of a material placed in the RTU buffer may be expressed as a k + 1-dimensional vector, which includes raw materials. In addition, the process information and state of each machine can be expressed as an m-dimensional vector, which has a magnitude equal to the total number of machines. The current position of the RTU is defined as the state for determining the distance at which the current material can be input/returned to/from the target machine. The RTU location is defined as m + 1 by adding all machine locations m to the line tray buffer location that replenishes materials.
However, when the methods defined above are used for reinforcement learning, when considering 20 machines, the environment has a 45-dimensional state space and 21 action spaces, which are very large spaces. Training an agent in such a large and high-dimensional environment is difficult, and the learning time is extremely long. Hence, redesigning the environment is necessary to enable efficient learning. The goal of redesigning is to reduce the number of dimensions and size of the state/action while incorporating as many environmental features as possible for effective learning.
It is assumed that the machines in the line are arranged on both sides of the rail at the same positions. Considering this, equivalent process machines on the line are grouped together in this study, as shown in
Figure 5. The moving distances from the group representative point to all target machines within the same group are the same. As a transfer destination in reinforcement learning becomes a group rather than a machine, the RTU location and machine state information included in the state space can be reduced by the number of groups. If any of the machines in a group are idle, the state of that group is idle. Conversely, the group is marked as in a working state only if all machines in a group are in a working state.
In this study, two machines are built as a group, because this number was found to be sufficient; however, more machines may be grouped according to the scenario of each line, as shown in
Figure 6.
Subsequently, the existing state information is concisely modified. First, the buffer state is determined by the presence or absence of materials in the buffer instead of the number of materials. The layout information about the process is excluded from the state information because it can be learned in the learning process. The machine state information is converted into group state information and reduced to express two states: idle (which is a combination of empty and completed states) and operation states. Buffer state information is a binary one-dimensional vector representing whether each buffer has material or not. Buffers include raw material buffers,
C1 completion buffers,
C2 completion buffers, …, and
Ck completion buffers. Regardless of the type of buffer, it is expressed as a one-dimensional binary vector because it only needs to check whether there is material in the buffer or not. In fact, it is not complicated because the total buffer size is tens of units at the most, and it can be expressed as a one-dimensional binary vector. Because the information regarding the machine state for each process is large, it is expressed in
dimensions, similar to existing methods. Finally, the redesigned state information is composed using a partial environment, as shown in
Figure 7.
3.2. Action Definition
The movement to the machine to which a material is to be input can be considered as an action; however, in this study, the point of grouping of the machines, instead of using individual machines, is selected for use in the previous state design. Therefore, the movement to a grouped representative point is designated as an action, as shown in
Figure 8. When a group point to move to is specified, the RTU moves on the rail accordingly and arrives at the target group representative point. In this concept, the part that delivers materials to actual machines is not an area of reinforcement learning. Therefore, it is necessary to deliver materials from a group to an actual machine. The current group consists of two machines of the same process located at the same distances. If one machine is idle, the RTU moves to that machine (randomly if both are idle); if neither machine is idle, it does not move to any machine in that group.
3.3. Reward Definition
A reward is defined based on the processing time of a machine and the travel time to a destination using the current position of the RTU supplying the material. The objective of this study is to schedule each task such that the total time is minimized. As shown in
Figure 9, long processing and material transfer times imply a small reward. Hence, a reward can be expressed as an inverse or negative value with respect to time.
As shown in
Figure 10, a sum of negative numbers is calculated as work progresses when a reward is a negative number. If no action is taken and no separate reward is set for waiting, a zero reward is obtained, which is greater than a negative value. In this case, the agent may possibly learn that it can wait without any action to receive a higher reward. Conversely, using a reverse reward is easier than additionally considering an appropriate reward when waiting for work. Therefore, in this study, a reward is defined in the reverse order of time.
Another important consideration when designing rewards is whether a reward should be penalized. If a reward is given as an inverse of the event time for all actions, learning is compromised, because rewards for irrelevant actions that do not contribute to achieving the objective accumulate. The primary action in this study is to move the RTU with a material buffer to specific machines for processing. Therefore, an instance in which the RTU is moved to a specific machine, but the material cannot be processed, is regarded as a penalty.
Figure 11 shows a case in which materials are to be input into a machine that is currently processing. As mentioned in
Section 1, because a machine can only process one material at a time, in this case, material processing cannot proceed. A case related to the buffer state is shown in
Figure 12.
Figure 12a depicts the case in which the action of moving to a machine for processing has occurred. This requires the presence of the material in the previous process completion buffer; however, in the current state, processing cannot proceed because the buffer is empty. Subsequently, it is necessary to retrieve a processed material from a machine and move it to the relevant process completion buffer. However, as shown in the case in
Figure 12b, the processed material cannot be retrieved because the buffer is full. In this buffer state, instructing an action to move the RTU will not lead to material processing. If the material can be processed, it is given a reward because it is the right choice. Conversely, if all the machines in the group are in operation and the material cannot be allocated, this is a wrong choice, and it penalizes them because processing cannot proceed.
A reward is assigned when a specific action is performed. Movement of the RTU and machine processing, the times for which are used as rewards, do not occur immediately, but they occur only after an actual operation is completed, as shown in
Figure 13a. However, when processing is time-consuming, establishing a causal relationship between the action and the reward can be difficult because the duration between them increases. Generally, processing has a certain time range; therefore, a reward can be predicted immediately once an action is determined, as shown in
Figure 13b. Therefore, this study uses a predetermined time to reward the system when making an action decision.
Finally, a reward calculated as defined in (1).
Here, refers to the time required to process the k process in the machine, and refers to the time required to move from the origin machine, , of the current RTU location to the destination machine, , in the total number of machines, m.
4. Experiments
The action space of the environment is composed of the destinations to which the RTU carrying the material will be moved. Because this is a discrete action space, a supporting algorithm must be used. In this study, we tested a deep Q-network (DQN) [
36] algorithm (which learns the value of the network) and a proximal policy optimization (PPO) [
37] algorithm (which learns the value/policy of the network). The obtained results were analysed and compared.
4.1. Common Experimental Setup
The server specifications and library version used in the experiments are listed in
Table 1. The hyperparameter sets for experiments using the DQN and PPO algorithms are reported in
Table 2 and
Table 3, respectively.
The simulation experiments replicated an actual production line. Generally, these experiments were conducted assuming that a total of 20 machines were arranged in a line mixed with three processes, C1, C2, and C3. In this study, basically, the experiment was conducted in a large environment with a problem size (number of machines × number of jobs) of 20 × 600 or more. To confirm their effectiveness in different environments, the experiments were conducted in two environments. For the first experiment, the numbers of C1, C2, and C3 machines were 8, 8, and 4, respectively. The processing times for the three processes were set as 800, 800, and 400 s, respectively. The input/return operation time of the material was set as 17 s for all processes. Assuming the trays to be exhausted are of 10 layers and that 9 additional materials are supplied, the objective was to process the material from a total of 100 layers. The size of the raw material buffer was limited to 6, and the size of the C1, C2, and C3 process completion buffers was limited to 4, 4, and 6, respectively. The RTU speed was set as 1.2 m/s. The aim of this study was to process 600 raw material units (100 layers × 6 pieces) into 600 process units after executing the C1, C2, and C3 processes.
For the second experiment, the numbers of C1, C2, and C3 machines were 8, 6, and 6, respectively. The processing times for the three processes were set as 800, 600, and 600 s, respectively. The input/return operation time of the material was set as 20 s for all processes. Assuming the trays to be exhausted were of 10 layers and that 9 additional materials were supplied, the aim was to process material from a total of 100 layers, similar to the first experiment. The size of the raw material buffer was limited to 8, and the size of the C1, C2, and C3 process completion buffers was limited to 6, 6, and 8, respectively. The RTU speed was set as 1.2 m/s. The aim of this study was to process 800 raw material units (100 layers × 8 pieces) into 800 process units after performing the C1, C2, and C3 processes.
4.2. Experimental Results
The concepts adopted in this study can be summarized as follows. First, a partial environmental method was established to reduce the size of the state and action compared to that of the overall environment. Second, an inverse reward, rather than a negative reward, was used. Third, regarding the timing of a reward, instead of giving a reward at the end of an actual event, a reward was given in advance by predicting the end of an event as soon as it occurs.
To prove the above, the experiments were conducted in the following order. First, an experiment using the entire environmental space was conducted, and a method that involves giving inverse rewards and rewards at the time of completion of an action was used. The algorithm was compared to both the DQN and PPO algorithms. Based on the results, the PPO method had much better indicators than the DQN method; thus, the former was used in all subsequent experiments. To prove the first concept, the improvement in performance achieved using a partial environmental space compared to the performance achieved when the entire environmental space is employed was determined. Moreover, inverse rewards and rewards at the time of completion of the actions were used. To prove the second concept, the effectiveness of an inverse reward compared to that of a negative reward was examined. Similar to the above experiment, a partial environmental space was used, and rewards were given after the end of an event. An experiment was also conducted using a negative reward instead of an inverse reward. Finally, a partial environmental space and an inverse reward were used to prove the concept, and a method of predicting and giving rewards in advance at the start of the event, instead of awarding them at the end of an event, was adopted.
Figure 14 shows the learning results based on the first experimental environment.
Subsequently, the total processing time required was measured and compared with that in the next experiment to verify the scheduling performance of each method. Three existing heuristic algorithms (proximity first, first-in-first-out (FIFO), and process first) and the reinforcement learning algorithm were used in the experiments on the entire and partial environments with inverse and negative rewards and method of reward timing. The total processing completion time of the reinforcement learning model was measured and used as an indicator. The proximity-priority scheduling method is a method of delivering materials to idle machines closest to the RTU in real time. The FIFO method delivers materials to each process machine in the order of materials currently contained in the buffer. Finally, the process-first scheduling method focuses on moving a material to the next process.
Table 4 and
Table 5 show the results of measurements on scheduling performance by method in the two experimental environments.
5. Discussion
As learning progresses, the rewards in
Figure 14a and cumulative rewards (score) in
Figure 14b tended to increase for most methods, and the total machining completion time in
Figure 14c gradually decreased. However, when the learning steps exceeded a certain value, the total machining completion time increased, which caused performance degradation. In particular,
Figure 14c shows the following results. The performance of the PPO algorithm in terms of the total machining completion time was better than that of the DQN algorithm. The suggested use of a subspace outperformed the use of the entire space. It showed that scheduling by a negative reward was impossible as the total processing completion time continuously increased. The performance of the scheduling system and the learning effect were higher when predicted rewards were used instead of actual occurrence rewards.
The experimental results for the performance comparison are shown in
Table 4 and
Table 5 in
Section 4.
Table 4 summarizes the results of the performance comparison of the experiments in the first environment. Among the three existing heuristic methods, the proximity-first scheduling method exhibited the best performance. The results of the reinforcement learning algorithm revealed that it performed very poorly in the early stages of learning. However, its performance gradually improved as the learning progressed. In agreement to the previous learning results, the performance degraded on excessive learning. When the PPO algorithm was applied to the entire environment, the peak performance was achieved after 40 million time steps. Subsequently, the performance did not improve further. The partial environment experiment showed improved completion times overall. The performance was achieved at 25 h, 36 min, 11 s at 30 million time steps. Comparing the reward forms, a negative reward could not minimize the total processing completion time because scheduling was impossible. Therefore, the inverse reward method established in this study was confirmed to achieve a better scheduling performance than the other methods. Finally, when applying the reward prediction method, i.e., the final method developed in this study, the best performance time was 24 h, 13 min, 38 s. This was approximately 59.90% better than that of the existing proximity-priority method. This result was achieved with a considerably shorter learning time compared to those used in the other methods.
Table 5 lists the results of the performance comparison experiments in the second environment. The FIFO method showed the best performance for heuristics in this environment. The second experiment also showed the best performance when using the subspace and inverse/prediction reward method developed in this study. With 28 h, 5 min, and 46 s performance at 20 million time steps, this was an 83.90% improvement over 51 h, 40 min, and 10 s for FIFO, the best of the traditional heuristics. We confirmed the performance by comparing it with the experimental results of existing studies compared to heuristic methods. The results of our experiment showed a higher improvement than the improvements of existing studies. The performance improvements compared to the heuristic methods in existing reinforcement learning studies were 14.1–15.9% [
32], 2.6–26.8% [
33], and 12.6–22.3% [
34]. Therefore, we conclude that the method established in this study is effective in solving real-world scheduling problems.
Our work did not address exceptional situations such as machine failures or delays during work. For future research, the following contents can be considered. If the processing time is prolonged by a delay during the processing, follow-up measures are required to compensate for the previously paid reward. For a large difference between the reward calculated in advance and the actual processing time, it will be possible to calculate the actual reward by excluding the additional time from the reward. The broader consideration of the exception situations will be the focus of future investigations.
6. Conclusions
In this study, a manufacturing environment was redesigned considering realistic factors to solve a hybrid flow-shop scheduling problem by applying deep reinforcement learning. Several previous studies have not considered practical constraints. The location of the RTU responsible for transferring materials changes in real time and affects scheduling. Hence, in this study, the environment was designed considering the RTU position. Moreover, the problem of a high-dimensional space originating from the use of mass machines in unmanned lines was solved by adopting a method of grouping machines. A reward prediction method was developed to minimize the gap between the action and reward times to increase the learning effect.
After reinforcement learning in the designed environment, total work completion time measurement experiments were performed, and the performance with existing heuristic scheduling methods was compared. Performance was improved by 59.90% and 83.90% in each experiment, and the superior performance of the proposed approach compared to the existing heuristic scheduling methods was confirmed. In addition, by comparative experiments, the developed method, such as reducing environmental space and predicting rewards, was confirmed to be considerably effective. These results demonstrate that deep reinforcement learning can be applied to high-dimensional environments in the real world to solve high-complexity scheduling problems. The scope for future studies includes investigating the scheduling of an entire manufacturing range across multiple lines by connections with logistics robots that supply materials between the lines.