Deep Reinforcement Learning Approach for Material Scheduling Considering High-Dimensional Environment of Hybrid Flow-Shop Problem

Gil, Chang-Bae; Lee, Jee-Hyong

doi:10.3390/app12189332

Open AccessArticle

Deep Reinforcement Learning Approach for Material Scheduling Considering High-Dimensional Environment of Hybrid Flow-Shop Problem

by

Chang-Bae Gil

¹

and

Jee-Hyong Lee

^2,*

¹

Department of Digital Media and Communications Engineering, Sungkyunkwan University, Suwon 16419, Korea

²

Department of Artificial Intelligence, Sungkyunkwan University, Suwon 16419, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(18), 9332; https://doi.org/10.3390/app12189332

Submission received: 8 August 2022 / Revised: 10 September 2022 / Accepted: 14 September 2022 / Published: 17 September 2022

(This article belongs to the Section Applied Industrial Technologies)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Manufacturing sites encounter various scheduling problems, which must be dealt with to efficiently manufacture products and reduce costs. With the development of smart factory technology, many elements at manufacturing sites have become unmanned and more complex. Moreover, owing to the mixing of several processes in one production line, the need for efficient scheduling of materials has emerged. The aim of this study is to solve the material scheduling problem of many machines in a hybrid flow-shop environment using deep reinforcement learning. Most previous work has ignored some conditions, which were critical for solving practical problems. Such critical conditions make the scheduling more complex and difficult to solve. They expand the size of the state and large action space and make learning in an environment with many machines problematic. In this study, a reinforcement learning approach was developed considering practical factors such as the processing time and material transfer to solve realistic manufacturing scheduling problems. Additionally, a method to simplify the high-dimensional environmental space at manufacturing sites for efficient learning was established to solve the problem of learning in a high-dimensional space. Through experiments, we showed that our approach could optimally schedule material scheduling in multi-process lines, which contributes to realistic manufacturing intelligence.

Keywords:

deep reinforcement learning; hybrid flow-shop problem; manufacturing intelligence; scheduling; smart factory

1. Introduction

A product manufacturing site involves various processes, machines, robots, and logistics. With the development of smart factory technology, many elements at manufacturing sites have become unmanned. In an unmanned line, a robot takes charge of materials input/return into/from the machines, and material supply is ensured using a scheduling logic that determines when and which materials to send to the machines. Therefore, efficient scheduling is essential for the timely supply of materials to the machines. Because robots are expensive pieces of equipment, they cannot be deployed in large numbers, and one robot should be able to handle as many machines as possible in one line. Material scheduling in such a multiprocessing line is a hybrid flow-shop scheduling problem because the processes are sequentially performed and several machines process in each stage. This study is aimed at dealing with material scheduling for supplying materials to machines in an unmanned multi-processing line, which is a hybrid flow-shop problem.

This problem can be defined as follows. First, the raw material to be processed sequentially is placed in a buffer. Subsequently, the raw material is removed from the buffer, placed into the machine of the first process, and returned to the buffer when the machine completes processing. Following this, the first-treated material is placed into the machine of the next required process. Accordingly, the material is sequentially processed by all processes. The scheduling problem to be addressed here refers to making the decision of into which machine to place a material in and when, so that the total processing time to complete the processing of all materials is optimized.

This study makes the following assumptions. The processes of a line must be sequential, and the order cannot be changed. Only one material can be processed at a time by a machine. If a material is loaded into a machine, a new material cannot be loaded to the machine until the first material is unloaded into the buffer. Several machines are used in each process, and one machine deals with one process. Figure 1 shows an example of an unmanned multi-processing line consisting of k processes, C1, C2, …, Ck. We assumed that machines in the manufacturing process are positioned on both sides of the central rail. One robot transfer unit (RTU) moves along the central rail and takes charge of loading material into and unloading material from the machines. This means that the position of supplying the material changes in real time. As shown in Figure 2, the RTU is equipped with a multi-joint robot arm to pick up materials and buffers for materials in processing and already processed, respectively. The buffer sizes may vary considering the space in the RTU. The RTU takes a tray of raw materials from the tray buffer and returns the tray of completed materials to the tray buffer after all materials are processed.

The problem covered in this study is the scheduling problem for a multi-process unmanned line. As mentioned above, there can be many machines in one line, which suggests that the size of the state and action of the environment is high, i.e., the environmental space is high-dimensional.

Material scheduling is a type of hybrid flow-shop scheduling problem. Heuristic methods have been proposed for hybrid flow-shop scheduling using dispatching rules [1] and various meta-heuristic methods [2,3,4,5,6,7]. However, these methods have the disadvantages of being difficult in their application to high-complexity problems, difficult in finding optimal solutions, and having a long calculation time.

Reinforcement-learning-based approaches are also used for flow-shop scheduling. However, studies that have used these approaches have ignored practical factors, including factors that increase the complexity of scheduling problems [8,9], and studied in environments with a small number of machines [9]. Some studies have ignored the transportation times of materials to machines and the moving times of robots [8,10,11,12,13] or have assumed an infinite buffer size [13,14]. In addition, most previous investigations have not addressed the problem of high-dimensional environments arising from many machines to be scheduled; instead, they have considered only scheduling for a few machines [9,15]. Learning in an environment with numerous machines is problematic because the size of the state and the action space is large [9]. Thus, the above approaches may be inapplicable to the material scheduling problem of a realistic unmanned line multi-process manufacturing unit with many target machines.

In this study, we propose an approach to material scheduling considering realistic constraints in high-dimensional environmental spaces with many machines. For this problem, which is difficult to learn, machines were grouped, and a prediction method for reward was developed. In addition, a reinforcement learning method considering material transfer time and buffer capacity was established to solve a real-world scheduling problem with realistic constraints. Accordingly, we showed that our approach can optimally schedule realistic material scheduling in multi-process lines with many machines.

2. Related Work

Shop scheduling problems can be classified according to classification criteria such as numbers of machines and resource constraints [16]. They can be classified into job-shop, flow-shop, and open-shop scheduling problems. This study addresses flow-shop scheduling problems because each job proceeds in a certain flow sequence by machines [17,18]. Hybrid flow-shop scheduling problems are more complex than simple flow-shop scheduling problems. As shown in Figure 3, hybrid flow-shop scheduling is a flow-shop problem with parallel machines, in which several similar or identical machines process in each stage. This study addresses a multi-process line scheduling problem [10,19,20]. Several machines can be selected at a stage, and the processing time varies by the process; therefore, selecting the appropriate machine to perform the process every time is necessary.

Although scheduling has various objectives, it generally minimizes the total time required to complete all tasks [17,18,21,22]. Typically, shop scheduling is a problem that combines various tasks, necessary processes, and various machines to perform these processes. Previous studies have shown that the problem complexity is very high because the obtained solutions of shop scheduling problems are typically numerous; furthermore, choosing which of these is optimal is a more difficult NP-hard problem [21,23,24].

Attempts have been made to solve scheduling problems with many different types of techniques, and approximation methods using heuristic and metaheuristic approaches have been used more frequently than accurate methods. However, they require considerable computational time [11].

Heuristic techniques include dispatching rule methods, such as the first-come-to-a-queue, first-service rule (FCFS), shortest processing time rule (SPT), longest processing time rule (LPT), and smallest remaining processing time rule (SRPT) [11,22,25]. Because of the advantage of simple and easy implementation, scheduling logic using dispatching rules is commonly used. However, in an environment with many exceptions and high complexity, it is hard to obtain high-performance schedules by applying only one basic priority rule.

Metaheuristic methods, such as Tabu search and genetic algorithm techniques, are independent of specific problems, and can be applied to various scenarios [11]. A genetic-algorithm-based approach that employed additional local search methods using more than one crossover operator was proposed [3]. Investigations based on the local neighbourhood parallel search method embedded in the Tabu search approach have also been performed [2]. There has also been a study using branch and bound algorithms in the manufacturing of components. This was a study that reflected realistic factors, such as considering buffer restrictions, and did achieve a slight performance improvement over conventional heuristic methods [5]. There has been a genetic-algorithm-based study that generated arbitrary neural networks capable of estimating task sequences [6]. A study has also been conducted to study efficient rule generation using gene expression programming [7]. However, generally, these methods have disadvantages: the difficulty in finding the optimal solution and the computation time increase required to obtain the optimal solution as the complexity of the problem increases. In our study, the location of materials that change in real time and a large environmental space increase the complexity of the problem.

Reinforcement learning techniques are being used to solve the above-mentioned flow-shop scheduling problem. There has been a paper that studied the enhanced e-greedy method designed as a trade-off of exploration and exploitation [26]. A study has been conducted using algorithms based on time differences by designating basic priority dispatching rules, such as FCFS, SPT, LPT, and SRPT, as actions and assigning a negative reward if a machine is inoperable [27]. One study used a method of constructing a reward function based on the idle time of a machine and approximating the value function with neural networks [15]. A Q-learning-based method for action-value-function learning was proposed to minimize the completion time of all work hours [28]. As an example of the application of reinforcement learning to hybrid flow-shop scheduling problems, some studies have used a Q-learning-based method to minimize the total machine time for transport scheduling [14,29]. A scheduling method for dispatching rules using a double-deep Q-network (DDQN) algorithm was developed [13]. In addition, a method using a DDQN algorithm focusing on the number of completions instead of the processing time was established [30]. There was a study using a sparse make-span minimization criterion with a combination optimization problem to solve the scheduling problem [31]. There was a study that combined a neural-network-based value function approximation and the use of optimistic inter-agent coordination schemes [32]. There have been studies that proposed a DQN approach to solve the precast concrete scheduling problem of minimizing total tardiness [33]. There have also been studies on probabilistic dispatch policies using a set of real-valued parameters [34]. However, as mentioned in Section 1, previous studies have treated environmental elements as real-world constraints, and have not addressed scheduling in high-dimensional environments with many machines.

Through comparison, it was indirectly confirmed that the degree of improved performance compared to the heuristic methods in the previous metaheuristic [5,6,7] and reinforcement learning [31,32,33,34] studies was lower than the degree of improvement in our study.

3. Proposed Method

This study aims to reduce the complexity of a real-world hybrid flow-shop scheduling problem and enable scheduling for multiple machines. When the scheduling problem is applied to the Markov decision process, the transition probability of the state change cannot be obtained in the target environment considered in this study. Hence, the problem was solved using a model-free method [35].

In this section, first, to apply reinforcement learning to real problems, a method of effectively representing the state of reinforcement learning is developed considering the realistic elements which were excluded in previous studies. Subsequently, a method of reducing the complexity of the problem is established by reducing the environmental space of the state and action of reinforcement learning. Finally, rewards that can alleviate the learning difficulties in applying scheduling problems to reinforcement learning are defined.

3.1. State Definition

Many previous scheduling studies considered only the machine operation time, such as the processing/idle time. This is because realistic factors such as material transfer time and buffer limitations render the scheduling problem more complex and difficult. However, without considering these realistic factors, a realistic scheduling problem cannot be ensured to have been solved.

First, the buffer state according to the limitation of the buffer size should be considered. A buffer is the space where raw materials to be processed are located and simultaneously stored until the next process operation is due. Because a buffer is physically limited in size, depending on its material state, removing a processed material from a machine or placing a material into a machine may be impossible. Therefore, during scheduling, the target machine cannot be determined simply based on the completion of processing; a buffer state is also required. The next factor to consider is the RTU, which transfers the material. In this study, the considered line is an unmanned multi-process line in which one RTU is responsible for material input/return through multiple machines. Because the location of the RTU changes in real time, determining the machine to which the optimal action is assigned to is practically impossible. The limitation of the buffer size and the location of the RTU that changes in real time correspond to the constraints of the real world. Considering the above, the state of the system is defined using the data generated from the machines and RTU, as shown in Figure 4.

When the total number of processes (the number of stages) is k, the state information of a material placed in the RTU buffer may be expressed as a k + 1-dimensional vector, which includes raw materials. In addition, the process information and state of each machine can be expressed as an m-dimensional vector, which has a magnitude equal to the total number of machines. The current position of the RTU is defined as the state for determining the distance at which the current material can be input/returned to/from the target machine. The RTU location is defined as m + 1 by adding all machine locations m to the line tray buffer location that replenishes materials.

However, when the methods defined above are used for reinforcement learning, when considering 20 machines, the environment has a 45-dimensional state space and 21 action spaces, which are very large spaces. Training an agent in such a large and high-dimensional environment is difficult, and the learning time is extremely long. Hence, redesigning the environment is necessary to enable efficient learning. The goal of redesigning is to reduce the number of dimensions and size of the state/action while incorporating as many environmental features as possible for effective learning.

It is assumed that the machines in the line are arranged on both sides of the rail at the same positions. Considering this, equivalent process machines on the line are grouped together in this study, as shown in Figure 5. The moving distances from the group representative point to all target machines within the same group are the same. As a transfer destination in reinforcement learning becomes a group rather than a machine, the RTU location and machine state information included in the state space can be reduced by the number of groups. If any of the machines in a group are idle, the state of that group is idle. Conversely, the group is marked as in a working state only if all machines in a group are in a working state.

In this study, two machines are built as a group, because this number was found to be sufficient; however, more machines may be grouped according to the scenario of each line, as shown in Figure 6.

Subsequently, the existing state information is concisely modified. First, the buffer state is determined by the presence or absence of materials in the buffer instead of the number of materials. The layout information about the process is excluded from the state information because it can be learned in the learning process. The machine state information is converted into group state information and reduced to express two states: idle (which is a combination of empty and completed states) and operation states. Buffer state information is a binary one-dimensional vector representing whether each buffer has material or not. Buffers include raw material buffers, C1 completion buffers, C2 completion buffers, …, and Ck completion buffers. Regardless of the type of buffer, it is expressed as a one-dimensional binary vector because it only needs to check whether there is material in the buffer or not. In fact, it is not complicated because the total buffer size is tens of units at the most, and it can be expressed as a one-dimensional binary vector. Because the information regarding the machine state for each process is large, it is expressed in

\frac{m}{2}

dimensions, similar to existing methods. Finally, the redesigned state information is composed using a partial environment, as shown in Figure 7.

3.2. Action Definition

The movement to the machine to which a material is to be input can be considered as an action; however, in this study, the point of grouping of the machines, instead of using individual machines, is selected for use in the previous state design. Therefore, the movement to a grouped representative point is designated as an action, as shown in Figure 8. When a group point to move to is specified, the RTU moves on the rail accordingly and arrives at the target group representative point. In this concept, the part that delivers materials to actual machines is not an area of reinforcement learning. Therefore, it is necessary to deliver materials from a group to an actual machine. The current group consists of two machines of the same process located at the same distances. If one machine is idle, the RTU moves to that machine (randomly if both are idle); if neither machine is idle, it does not move to any machine in that group.

3.3. Reward Definition

A reward is defined based on the processing time of a machine and the travel time to a destination using the current position of the RTU supplying the material. The objective of this study is to schedule each task such that the total time is minimized. As shown in Figure 9, long processing and material transfer times imply a small reward. Hence, a reward can be expressed as an inverse or negative value with respect to time.

As shown in Figure 10, a sum of negative numbers is calculated as work progresses when a reward is a negative number. If no action is taken and no separate reward is set for waiting, a zero reward is obtained, which is greater than a negative value. In this case, the agent may possibly learn that it can wait without any action to receive a higher reward. Conversely, using a reverse reward is easier than additionally considering an appropriate reward when waiting for work. Therefore, in this study, a reward is defined in the reverse order of time.

Another important consideration when designing rewards is whether a reward should be penalized. If a reward is given as an inverse of the event time for all actions, learning is compromised, because rewards for irrelevant actions that do not contribute to achieving the objective accumulate. The primary action in this study is to move the RTU with a material buffer to specific machines for processing. Therefore, an instance in which the RTU is moved to a specific machine, but the material cannot be processed, is regarded as a penalty.

Figure 11 shows a case in which materials are to be input into a machine that is currently processing. As mentioned in Section 1, because a machine can only process one material at a time, in this case, material processing cannot proceed. A case related to the buffer state is shown in Figure 12. Figure 12a depicts the case in which the action of moving to a machine for processing has occurred. This requires the presence of the material in the previous process completion buffer; however, in the current state, processing cannot proceed because the buffer is empty. Subsequently, it is necessary to retrieve a processed material from a machine and move it to the relevant process completion buffer. However, as shown in the case in Figure 12b, the processed material cannot be retrieved because the buffer is full. In this buffer state, instructing an action to move the RTU will not lead to material processing. If the material can be processed, it is given a reward because it is the right choice. Conversely, if all the machines in the group are in operation and the material cannot be allocated, this is a wrong choice, and it penalizes them because processing cannot proceed.

A reward is assigned when a specific action is performed. Movement of the RTU and machine processing, the times for which are used as rewards, do not occur immediately, but they occur only after an actual operation is completed, as shown in Figure 13a. However, when processing is time-consuming, establishing a causal relationship between the action and the reward can be difficult because the duration between them increases. Generally, processing has a certain time range; therefore, a reward can be predicted immediately once an action is determined, as shown in Figure 13b. Therefore, this study uses a predetermined time to reward the system when making an action decision.

Finally, a reward calculated as defined in (1).

r = {\begin{matrix} 0 \\ \frac{1}{T i m e_{m_{s r c}, m_{d s t}}} + \frac{1}{T i m e_{k}} \end{matrix} \begin{matrix} \begin{matrix} if m o v i n g w h e n u n a v a i l a b l e \\ m a c h i n e o r b u f f e r \end{matrix} \\ \begin{matrix} else if m o v i n g w h e n a v a i l a b l e \\ m a c h i n e a n d b u f f e r \end{matrix} \end{matrix}

(1)

Here,

T i m e_{k}

refers to the time required to process the k process in the machine, and

T i m e_{m_{s r c}, m_{d s t}}

refers to the time required to move from the origin machine,

m_{s r c}

, of the current RTU location to the destination machine,

m_{d s t}

, in the total number of machines, m.

4. Experiments

The action space of the environment is composed of the destinations to which the RTU carrying the material will be moved. Because this is a discrete action space, a supporting algorithm must be used. In this study, we tested a deep Q-network (DQN) [36] algorithm (which learns the value of the network) and a proximal policy optimization (PPO) [37] algorithm (which learns the value/policy of the network). The obtained results were analysed and compared.

4.1. Common Experimental Setup

The server specifications and library version used in the experiments are listed in Table 1. The hyperparameter sets for experiments using the DQN and PPO algorithms are reported in Table 2 and Table 3, respectively.

The simulation experiments replicated an actual production line. Generally, these experiments were conducted assuming that a total of 20 machines were arranged in a line mixed with three processes, C1, C2, and C3. In this study, basically, the experiment was conducted in a large environment with a problem size (number of machines × number of jobs) of 20 × 600 or more. To confirm their effectiveness in different environments, the experiments were conducted in two environments. For the first experiment, the numbers of C1, C2, and C3 machines were 8, 8, and 4, respectively. The processing times for the three processes were set as 800, 800, and 400 s, respectively. The input/return operation time of the material was set as 17 s for all processes. Assuming the trays to be exhausted are of 10 layers and that 9 additional materials are supplied, the objective was to process the material from a total of 100 layers. The size of the raw material buffer was limited to 6, and the size of the C1, C2, and C3 process completion buffers was limited to 4, 4, and 6, respectively. The RTU speed was set as 1.2 m/s. The aim of this study was to process 600 raw material units (100 layers × 6 pieces) into 600 process units after executing the C1, C2, and C3 processes.

For the second experiment, the numbers of C1, C2, and C3 machines were 8, 6, and 6, respectively. The processing times for the three processes were set as 800, 600, and 600 s, respectively. The input/return operation time of the material was set as 20 s for all processes. Assuming the trays to be exhausted were of 10 layers and that 9 additional materials were supplied, the aim was to process material from a total of 100 layers, similar to the first experiment. The size of the raw material buffer was limited to 8, and the size of the C1, C2, and C3 process completion buffers was limited to 6, 6, and 8, respectively. The RTU speed was set as 1.2 m/s. The aim of this study was to process 800 raw material units (100 layers × 8 pieces) into 800 process units after performing the C1, C2, and C3 processes.

4.2. Experimental Results

The concepts adopted in this study can be summarized as follows. First, a partial environmental method was established to reduce the size of the state and action compared to that of the overall environment. Second, an inverse reward, rather than a negative reward, was used. Third, regarding the timing of a reward, instead of giving a reward at the end of an actual event, a reward was given in advance by predicting the end of an event as soon as it occurs.

To prove the above, the experiments were conducted in the following order. First, an experiment using the entire environmental space was conducted, and a method that involves giving inverse rewards and rewards at the time of completion of an action was used. The algorithm was compared to both the DQN and PPO algorithms. Based on the results, the PPO method had much better indicators than the DQN method; thus, the former was used in all subsequent experiments. To prove the first concept, the improvement in performance achieved using a partial environmental space compared to the performance achieved when the entire environmental space is employed was determined. Moreover, inverse rewards and rewards at the time of completion of the actions were used. To prove the second concept, the effectiveness of an inverse reward compared to that of a negative reward was examined. Similar to the above experiment, a partial environmental space was used, and rewards were given after the end of an event. An experiment was also conducted using a negative reward instead of an inverse reward. Finally, a partial environmental space and an inverse reward were used to prove the concept, and a method of predicting and giving rewards in advance at the start of the event, instead of awarding them at the end of an event, was adopted. Figure 14 shows the learning results based on the first experimental environment.

Subsequently, the total processing time required was measured and compared with that in the next experiment to verify the scheduling performance of each method. Three existing heuristic algorithms (proximity first, first-in-first-out (FIFO), and process first) and the reinforcement learning algorithm were used in the experiments on the entire and partial environments with inverse and negative rewards and method of reward timing. The total processing completion time of the reinforcement learning model was measured and used as an indicator. The proximity-priority scheduling method is a method of delivering materials to idle machines closest to the RTU in real time. The FIFO method delivers materials to each process machine in the order of materials currently contained in the buffer. Finally, the process-first scheduling method focuses on moving a material to the next process. Table 4 and Table 5 show the results of measurements on scheduling performance by method in the two experimental environments.

5. Discussion

As learning progresses, the rewards in Figure 14a and cumulative rewards (score) in Figure 14b tended to increase for most methods, and the total machining completion time in Figure 14c gradually decreased. However, when the learning steps exceeded a certain value, the total machining completion time increased, which caused performance degradation. In particular, Figure 14c shows the following results. The performance of the PPO algorithm in terms of the total machining completion time was better than that of the DQN algorithm. The suggested use of a subspace outperformed the use of the entire space. It showed that scheduling by a negative reward was impossible as the total processing completion time continuously increased. The performance of the scheduling system and the learning effect were higher when predicted rewards were used instead of actual occurrence rewards.

The experimental results for the performance comparison are shown in Table 4 and Table 5 in Section 4. Table 4 summarizes the results of the performance comparison of the experiments in the first environment. Among the three existing heuristic methods, the proximity-first scheduling method exhibited the best performance. The results of the reinforcement learning algorithm revealed that it performed very poorly in the early stages of learning. However, its performance gradually improved as the learning progressed. In agreement to the previous learning results, the performance degraded on excessive learning. When the PPO algorithm was applied to the entire environment, the peak performance was achieved after 40 million time steps. Subsequently, the performance did not improve further. The partial environment experiment showed improved completion times overall. The performance was achieved at 25 h, 36 min, 11 s at 30 million time steps. Comparing the reward forms, a negative reward could not minimize the total processing completion time because scheduling was impossible. Therefore, the inverse reward method established in this study was confirmed to achieve a better scheduling performance than the other methods. Finally, when applying the reward prediction method, i.e., the final method developed in this study, the best performance time was 24 h, 13 min, 38 s. This was approximately 59.90% better than that of the existing proximity-priority method. This result was achieved with a considerably shorter learning time compared to those used in the other methods.

Table 5 lists the results of the performance comparison experiments in the second environment. The FIFO method showed the best performance for heuristics in this environment. The second experiment also showed the best performance when using the subspace and inverse/prediction reward method developed in this study. With 28 h, 5 min, and 46 s performance at 20 million time steps, this was an 83.90% improvement over 51 h, 40 min, and 10 s for FIFO, the best of the traditional heuristics. We confirmed the performance by comparing it with the experimental results of existing studies compared to heuristic methods. The results of our experiment showed a higher improvement than the improvements of existing studies. The performance improvements compared to the heuristic methods in existing reinforcement learning studies were 14.1–15.9% [32], 2.6–26.8% [33], and 12.6–22.3% [34]. Therefore, we conclude that the method established in this study is effective in solving real-world scheduling problems.

Our work did not address exceptional situations such as machine failures or delays during work. For future research, the following contents can be considered. If the processing time is prolonged by a delay during the processing, follow-up measures are required to compensate for the previously paid reward. For a large difference between the reward calculated in advance and the actual processing time, it will be possible to calculate the actual reward by excluding the additional time from the reward. The broader consideration of the exception situations will be the focus of future investigations.

6. Conclusions

In this study, a manufacturing environment was redesigned considering realistic factors to solve a hybrid flow-shop scheduling problem by applying deep reinforcement learning. Several previous studies have not considered practical constraints. The location of the RTU responsible for transferring materials changes in real time and affects scheduling. Hence, in this study, the environment was designed considering the RTU position. Moreover, the problem of a high-dimensional space originating from the use of mass machines in unmanned lines was solved by adopting a method of grouping machines. A reward prediction method was developed to minimize the gap between the action and reward times to increase the learning effect.

After reinforcement learning in the designed environment, total work completion time measurement experiments were performed, and the performance with existing heuristic scheduling methods was compared. Performance was improved by 59.90% and 83.90% in each experiment, and the superior performance of the proposed approach compared to the existing heuristic scheduling methods was confirmed. In addition, by comparative experiments, the developed method, such as reducing environmental space and predicting rewards, was confirmed to be considerably effective. These results demonstrate that deep reinforcement learning can be applied to high-dimensional environments in the real world to solve high-complexity scheduling problems. The scope for future studies includes investigating the scheduling of an entire manufacturing range across multiple lines by connections with logistics robots that supply materials between the lines.

Author Contributions

Conceptualization, C.-B.G.; methodology, C.-B.G.; software, C.-B.G.; validation, C.-B.G.; formal analysis, C.-B.G.; investigation, C.-B.G.; resources, C.-B.G.; data curation, C.-B.G.; writing—original draft preparation, C.-B.G.; writing—review and editing, C.-B.G., J.-H.L.; visualization, C.-B.G.; supervision, J.-H.L.; project administration, C.-B.G.; funding acquisition, J.-H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00421, Artificial Intelligence Graduate School Program (Sungkyunkwan University) and No. 2020-0-00990, Platform Development and Proof of High Trust & Low Latency Processing for Heterogeneous, Atypical, and Large Scaled Data in 5G-IoT Environment).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Madenoğlu, F. Solving the hybrid flow shop scheduling problem using heuristic algorithms. Bus. Manag. Stud. Int. J. 2019, 7, 14–25. [Google Scholar] [CrossRef]
Bożejko, W.; Pempera, J.; Smutnicki, C. Parallel tabu search algorithm for the hybrid flow shop problem. Comput. Ind. Eng. 2013, 65, 466–474. [Google Scholar] [CrossRef]
Ouled Bedhief, A.; Dridi, N. A Genetic Algorithm for Three-Stage Hybrid Flow Shop Scheduling Problem with Dedicated Machines. J. Eur. Systèmes Autom. 2020, 53, 357–368. [Google Scholar] [CrossRef]
Umam, M.S.; Mustafid, M.; Suryono, S. A hybrid genetic algorithm and tabu search for minimizing makespan in flow shop scheduling problem. J. King Saud Univ. Comput. Inf. Sci. 2021; in press. [Google Scholar] [CrossRef]
Azami, A.; Demirli, K.; Bhuiyan, N. Scheduling in aerospace composite manufacturing systems: A two-stage hybrid flow shop problem. Int. J. Adv. Manuf. Technol. 2018, 95, 3259–3274. [Google Scholar] [CrossRef]
Tobias, R.; Abdulrahman, N.; Fabian, B.; Sebastian, L. Evolving Neural Networks to Solve a Two-Stage Hybrid Flow Shop Scheduling Problem with Family Setup Times. In Proceedings of the 53rd Hawaii International Conference on System Sciences, Maui, HI, USA, 7–10 January 2020. [Google Scholar]
Wu, X.; Cao, Z.; Wu, S. Real-Time Hybrid Flow Shop Scheduling Approach in Smart Manufacturing Environment. Complex Syst. Modeling Simul. 2021, 1, 335–350. [Google Scholar] [CrossRef]
Chaudhry, I.A.; Khan, A.A. A research survey: Review of flexible job shop scheduling techniques. Int. Trans. Oper. Res. 2016, 23, 551–591. [Google Scholar] [CrossRef]
Yang, S.; Xu, Z.; Wang, J. Intelligent Decision-Making of Scheduling for Dynamic Permutation Flowshop via Deep Reinforcement Learning. Sensors 2021, 21, 1019. [Google Scholar] [CrossRef]
Fan, K.; Zhai, Y.; Li, X.; Wang, M. Review and classification of hybrid shop scheduling. Prod. Eng. 2018, 12, 597. [Google Scholar] [CrossRef]
Tyagi, N.; Varshney, R.; Chandramouli, A. Six decades of flowshop scheduling research. Int. J. Sci. Eng. Res. 2013, 4, 854–864. [Google Scholar]
Reyna, Y.C.F.; Cáceres, A.P.; Jiménez, Y.M.; Reyes, Y.T. An Improvement of Reinforcement Learning Approach for Permutation of Flow-Shop Scheduling Problems. Rev. Ibérica Sist. Tecnol. Inf. 2019, E18, 257–270. [Google Scholar]
Yang, S.; Xu, Z. Intelligent Scheduling for Permutation Flow Shop with Dynamic Job Arrival via Deep Reinforcement Learning. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; pp. 2672–2677. [Google Scholar]
Han, W.; Guo, F.; Su, X. A reinforcement learning method for a hybrid flow-shop scheduling problem. Algorithms 2019, 12, 222. [Google Scholar] [CrossRef] [Green Version]
Ren, J.; Ye, C.; Yang, F. Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network. Alex. Eng. J. 2021, 60, 2787–2800. [Google Scholar] [CrossRef]
Zhang, J.; Ding, G.; Zou, Y.; Qin, S.; Fu, J. Review of job shop scheduling research and its new perspectives under Industry 4.0. J. Intell. Manuf. 2019, 30, 1809–1830. [Google Scholar] [CrossRef]
Jones, A.; Rabelo, L.C.; Sharawi, A.T. Survey of Job Shop Scheduling Techniques; NISTIR, National Institute of Standards and Technology: Gaithersburg, MD, USA, 1998. [Google Scholar]
Parveen, S.; Ullah, H. Review on job-shop and flow-shop scheduling using. J. Mech. Eng. 2010, 41, 130–146. [Google Scholar] [CrossRef]
Linn, R.; Zhang, W. Hybrid flow shop scheduling: A survey. Comput. Ind. Eng. 1999, 37, 57–61. [Google Scholar] [CrossRef]
Ruiz, R.; Rodríguez, J.A.V. The hybrid flow shop scheduling problem. Eur. J. Oper. Res. 2010, 205, 1–18. [Google Scholar] [CrossRef]
Lourenco, H. A Computational Study of the Job-Shop and the Flow-Shop Scheduling Problems; Cornell University: Ithaca, NY, USA, 1993. [Google Scholar]
Agha, M.E.; El Baradie, M. Flow-Shop Scheduling: Effect of Various Priority Rules on Minimizing Multiple Criteria. In Proceedings of the Thirtieth International MATADOR Conference, Manchester, UK, 31st March–1st April 1993; Kochhar, A.K., Ed.; Macmillan Education: London, UK, 1993; pp. 733–739. [Google Scholar]
Brucker, P. Scheduling Algorithms, 5th ed.; Springer: Berlin/Heidelberg, Germany,, 2007. [Google Scholar]
Brucker, P.; Lenstra, J.K.; Kan, A.H.G.R. Complexity of Machine Scheduling Problems; Stichting Mathematisch Centrum: Amsterdam, The Netherlands, 1975. [Google Scholar]
Panwalkar, S.S.; Wafik, I. A Survey of Scheduling Rules. Oper. Res. 1977, 25, 45–61. [Google Scholar] [CrossRef]
Guo, F.; Li, Y.; Liu, A.; Liu, Z. A Reinforcement Learning Method to Scheduling Problem of Steel Production Process. J. Phys. Conf. Ser. 2020, 1486, 072035. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, W.; Zhong, S.; Hu, K. Flow shop scheduling with reinforcement learning. Asia-Pac. J. Oper. Res. 2013, 30, 1350014. [Google Scholar] [CrossRef]
Fonseca-Reyna, Y.C.; Martínez-Jiménez, Y.; Nowé, A. Q-learning algorithm performance for m-machine, n-jobs flow shop scheduling problems to minimize makespan. Investig. Oper. 2017, 38, 281–290. [Google Scholar]
Pipple, R.R. The Application of Deep Reinforcement Learning to the Hybrid Flow-Shop Scheduling Problem. Ph.D. Thesis, Tilburg University, Tilburg, The Netherlands, 2019. [Google Scholar]
Lamprecht, R.; Wurst, F.; Huber, M.F. Reinforcement Learning based Condition-oriented Maintenance Scheduling for Flow Line Systems. In Proceedings of the 2021 IEEE 19th International Conference on Industrial Informatics (INDIN), Palma de Mallorca, Spain, 21–23 July 2021; pp. 1–7. [Google Scholar]
Tassel, P.; Gebser, M.; Schekotihin, K. A Reinforcement Learning Environment for Job-Shop Scheduling. arXiv 2021, arXiv:2104.03760. [Google Scholar]
Gabel, T.; Riedmiller, M. Adaptive Reactive Job-Shop Scheduling with Reinforcement Learning Agents. Int. J. Inf. Technol. Intell. Comput. 2008, 24, 14–18. [Google Scholar]
Kim, T.; Kim, Y.-W.; Lee, D.; Kim, M. Reinforcement learning approach to scheduling of precast concrete production. J. Clean. Prod. 2022, 336, 130419. [Google Scholar] [CrossRef]
Gabel, T.; Riedmiller, M. Distributed policy search reinforcement learning for job-shop scheduling tasks. Int. J. Prod. Res. 2012, 50, 41–61. [Google Scholar] [CrossRef]
Van Otterlo, M.; Wiering, M. Reinforcement learning and markov decision processes. In Reinforcement Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 3–42. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]

Figure 1. Example configuration of multi-processing line. It consists of machines, an RTU moving on a rail, buffers, and tray buffers to which the material is supplied. The number of machines for each process is not fixed but can vary depending on the line scenario.

Figure 2. RTU structure and movement flow of material trays to buffers.

Figure 3. Flowchart of hybrid flow-shop scheduling.

Figure 4. Defining the reinforcement learning state. If the total number of processes is 3, the total number of buffers is 4 including buffers for raw materials. There are three machine states: no material, working, and work complete.

Figure 5. Reduced environment achieved by grouping machines. Distances from the group point to all target machines are the same because a group point location is designated. All state and action criteria in the developed method are changed from machine to group.

Figure 6. Grouping for multiple installations. (a) Example of grouping of three machines; (b) example of grouping of four machines. Distances from all machines to the group point are the same.

Figure 7. Defining final state of reinforcement learning.

Figure 8. Reinforcement learning action.

Figure 9. Relationship between event time and reward.

Figure 10. Cumulative reward difference according to the type of reward.

Figure 11. Reward penalty according to machine state.

Figure 12. Reward penalty for machine material input/recovery: (a) Penalty for inputting machine materials; (b) penalty for recovering machine materials.

Figure 13. Determining system reward instances: (a) reward is given after event completion; (b) reward is predicted before action occurs.

Figure 14. Results of learning by each method. (a) Trend of rewards per learning time step; (b) trend of cumulative rewards per learning time step; (c) trend of total machining completion time per learning time step.

Table 1. Experimental environment.

Resource	Version
OS	Ubuntu 18.04.3
CPU	Intel^® Xeon^® Silver 4210R CPU @ 2.40GHz
GPU	NVIDIA GeForce RTX 3090
PyTorch	1.8.1
CUDA	10.2
CuDNN	8.11

Table 2. Hyperparameter set in experiments with DQN.

Hyperparameter	Value
learning rate	${1 \times 10}^{- 4}$
discount factor	$9 {. 9 \times 10}^{- 1}$
buffer size	${1 \times 10}^{6}$
batch size	32
learning time step	${1 \times 10}^{8}$
epsilon greedy probability	1.0→ ${5 \times 10}^{- 2}$
target network update cycle	${1 \times 10}^{4}$
Q value network layer/number of nodes	2/64

Table 3. Hyperparameter set in experiments with PPO.

Hyperparameter	Value
learning rate	$3 \times 10^{- 4}$
discount factor	$9 {. 9 \times 10}^{- 1}$
buffer size	64
learning time step	$5 \times 10^{7}$
clip range	${1 \times 10}^{- 1}$
target network update cycle	${1 \times 10}^{4}$
policy network layer/number of nodes	2/64
value network layer/number of nodes	2/64

Table 4. Job completion time by scheduling algorithm in the first experiment.

Algorithm	Space	Reward		Learning Time Step
Algorithm	Space	Reward		1M	5M	10M	15M	20M	25M	30M	35M	40M	45M	50M
DQN	full	inverse	occurrence	167 h 8 min 39 s	136 h 56 min 41 s	85 h 34 min 57 s	82 h 27 min 18 s	71 h 41 min 5 s	93 h 14 min 7 s	121 h 4 min 44 s	99 h 11 min 23 s	81 h 9 min 37 s	62 h 40 min 46 s	50 h 28 min 7 s
PPO	full	inverse	occurrence	121 h 24 min 28 s	64 h 11 min 23 s	59 h 1 min 35 s	55 h 6 min 12 s	51 h 45 min 26 s	50 h 22 min 10 s	47 h 47 min 3 s	42 h 11 min 23 s	38 h 48 min 18 s	38 h 54 min 15 s	38 h 57 min 36 s
PPO	sub	inverse	occurrence	33 h 26 min 2 s	25 h 39 min 25 s	33 h 55 min 40 s	25 h 49 min 16 s	25 h 51 min 2 s	34 h 48 min 59 s	25 h 36 min 11 s	32 h 29 min 17 s	26 h 2 min 53 s	30 h 28 min 58 s	26 h 47 min 51 s
	sub	negative	occurrence	-	-	-	-	-	-	-	-	-	-	-
	sub	inverse	prediction	24 h 25 min 3 s	24 h 39 min 31 s	24 h 29 min 36 s	24 h 35 min 54 s	24 h 26 min 45 s	24 h 13 min 38 s	24 h 18 min 1 s	24 h 15 min 0 s	24 h 33 min 59 s	24 h 17 min 3 s	24 h 14 min 38 s
heuristic	proximity first			38 h 44 min 22 s
	FIFO			39 h 49 min 29 s
	process first			48 h 36 min 57 s

The negative reward method, which cannot be scheduled during learning and does not produce actual performance results, is not the method proposed in this paper. It is a result to show that the inverse reward method we proposed was better compared to the negative method. The bolds are the best performance result in reinforcement learning methods and heuristic methods, respectively.

Table 5. Job completion time by scheduling algorithm in the second experiment.

Algorithm	Space	Reward		Learning Time Step
Algorithm	Space	Reward		1M	5M	10M	15M	20M	25M	30M	35M	40M	45M	50M
DQN	full	inverse	occurrence	-	-	160 h 17 min 47 s	141 h 32 min 5 s	104 h 27 min 36 s	118 h 46 min 52 s	76 h 35 min 41 s	74 h 8 min 13 s	72 h 39 min 17 s	83 h 21 min 28 s	100 h 35 min 33 s
PPO	full	inverse	occurrence	-	93 h 49 min 21 s	84 h 13 min 40 s	80 h 22 min 8 s	75 h 13 min 15 s	67 h 28 min 28 s	57 h 25 min 45 s	59 h 42 min 32 s	61 h 30 min 26 s	61 h 58 min 7 s	63 h 49 min 43 s
PPO	sub	inverse	occurrence	43 h 10 min 20 s	35 h 39 min 23 s	31 h 41 min 49 s	35 h 41 min 59 s	31 h 17 min 27 s	28 h 48 min 50 s	31 h 14 min 33 s	30 h 46 min 4 s	31 h 17 min 34 s	31 h 45 min 48 s	32 h 15 min 16 s
	sub	negative	occurrence	-	-	-	-	-	-	-	-	-	-	-
	sub	inverse	prediction	29 h 12 min 13 s	28 h 59 min 2 s	28 h 52 min 32 s	28 h 35 min 18 s	28 h 5 min 46 s	28 h 28 min 43 s	31 h 38 min 18 s	28 h 54 min 16 s	28 h 9 min 11 s	28 h 42 min 25 s	29 h 2 min 50 s
heuristic	proximity first			56 h 53 min 7 s
	FIFO			51 h 40 min 10 s
	process first			65 h 27 min 26 s

The negative reward method, which cannot be scheduled during learning and does not produce actual performance results, is not the method proposed in this paper. It is a result to show that the inverse reward method we proposed was better compared to the negative method. The bolds are the best performance result in reinforcement learning methods and heuristic methods, respectively.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gil, C.-B.; Lee, J.-H. Deep Reinforcement Learning Approach for Material Scheduling Considering High-Dimensional Environment of Hybrid Flow-Shop Problem. Appl. Sci. 2022, 12, 9332. https://doi.org/10.3390/app12189332

AMA Style

Gil C-B, Lee J-H. Deep Reinforcement Learning Approach for Material Scheduling Considering High-Dimensional Environment of Hybrid Flow-Shop Problem. Applied Sciences. 2022; 12(18):9332. https://doi.org/10.3390/app12189332

Chicago/Turabian Style

Gil, Chang-Bae, and Jee-Hyong Lee. 2022. "Deep Reinforcement Learning Approach for Material Scheduling Considering High-Dimensional Environment of Hybrid Flow-Shop Problem" Applied Sciences 12, no. 18: 9332. https://doi.org/10.3390/app12189332

APA Style

Gil, C. -B., & Lee, J. -H. (2022). Deep Reinforcement Learning Approach for Material Scheduling Considering High-Dimensional Environment of Hybrid Flow-Shop Problem. Applied Sciences, 12(18), 9332. https://doi.org/10.3390/app12189332

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning Approach for Material Scheduling Considering High-Dimensional Environment of Hybrid Flow-Shop Problem

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. State Definition

3.2. Action Definition

3.3. Reward Definition

4. Experiments

4.1. Common Experimental Setup

4.2. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI