Next Article in Journal
Experimental Tests on In Situ Combustion Using Dynamic Ignition Simulation System in High-Temperature and High-Pressure Conditions
Previous Article in Journal
A Modified Method for the Fredlund and Xing (FX) Model of Soil-Water Retention Curves
Previous Article in Special Issue
Multi-Objective Flexible Flow Shop Production Scheduling Problem Based on the Double Deep Q-Network Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reinforcement Learning-Based Multi-Objective of Two-Stage Blocking Hybrid Flow Shop Scheduling Problem

1
School of Science, Shenyang Ligong University, Shenyang 110159, China
2
Liaoning Key Laboratory of Intelligent Optimization and Control for Ordnance Industry, Shenyang 110159, China
3
School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China
*
Author to whom correspondence should be addressed.
Processes 2024, 12(1), 51; https://doi.org/10.3390/pr12010051
Submission received: 19 October 2023 / Revised: 15 December 2023 / Accepted: 22 December 2023 / Published: 25 December 2023

Abstract

:
Consideration of upstream congestion caused by busy downstream machinery, as well as transportation time between different production stages, is critical for improving production efficiency and reducing energy consumption in process industries. A two-stage hybrid flow shop scheduling problem is studied with the objective of the makespan and the total energy consumption while taking into consideration blocking and transportation restrictions. An adaptive objective selection-based Q-learning algorithm is designed to solve the problem. Nine state characteristics are extracted from real-time information about jobs, machines, and waiting processing queues. As scheduling actions, eight heuristic rules are used, including SPT, FCFS, Johnson, and others. To address the multi-objective optimization problem, an adaptive objective selection strategy based on t-tests is designed for making action decisions. This strategy can determine the optimization objective based on the confidence of the objective function under the current job and machine state, achieving coordinated optimization for multiple objectives. The experimental results indicate that the proposed algorithm, in comparison to Q-learning and the non-dominated sorting genetic algorithm, has shown an average improvement of 4.19% and 22.7% in the makespan, as well as 5.03% and 9.8% in the total energy consumption, respectively. The generated scheduling solutions provide theoretical guidance for production scheduling in process industries such as steel manufacturing. This contributes to helping enterprises reduce blocking and transportation energy consumption between upstream and downstream.

1. Introduction

The hybrid flow shop scheduling problem (HFSP), which combines the features of traditional flow shop scheduling and parallel machine scheduling, is widely employed in the auto industry, food processing, steel forging [1], and other industries. The HFSP buffer is always intended to be infinite; however, owing to product processes and technological restrictions, the buffer is sometimes non-existent or confined. As a result, when all the machines in the following stage are in the processing state, the jobs processed in the previous stage will be blocked on the present machine until an idle machine in the next stage becomes available [2]. This is referred to as the blocking hybrid flow shop scheduling problem (BHFSP). Blocking increases the waiting time for jobs, resulting in a longer makespan and an increase in energy consumption, both of which have an influence on production efficiency. With increased worldwide environmental consciousness and the implementation of China’s carbon peak and carbon neutrality goals, energy consumption is increasingly being emphasized as a critical green production metric for enterprises. At the same time, the transportation time of materials between different stages in the process industry cannot be ignored. As a result, in the production of hybrid flow shops, considering the impact of blocking, coordinating production, transportation times, and energy consumption, drawing up production plans can efficiently utilize resources, reduce production costs, and enhance enterprise competitiveness.
We take the heating–rolling stage in a steel enterprise as an example. When there is a need for processing, the slab is first heated in the heating furnace and then transported by trolley to the rolling stage. Steel rolling includes hot rolling, tube rolling, structural steel rolling, and wire rolling, etc. Semi-finished slabs will be blocked in the heating furnace for insulation when the slab is heated in the heating furnace and all the machines in the rolling stage are in processing mode to prevent material deterioration. The job blocking on machines can lead to energy waste and delay the delivery time. Therefore, a multi-objective scheduling problem for a hybrid flow shop considering transportation time was refined by jointly considering economic and green indicators. Figure 1 depicts the process flowchart for the heating–rolling stage.
From the past to the present, the majority of scholars have focused on researching the blocking flow shop scheduling problem (BFSP). Du et al. [3] investigated a distributed BFSP with an assembly machine and optimized it for total assembly completion time. They proposed an effective discrete monarch butterfly optimization algorithm. Miyata et al. [4] aimed to minimize the total completion time subject to total maintenance costs in BFSP and introduced a mixed-integer linear programming method to solve the problem. Cheng et al. [5] aimed to minimize the total completion time and proposed an effective metaheuristic algorithm to solve BFSP with sequence-dependent setup times. Zhao et al. [6] studied the distributed assembly BFSP with the total tardiness criterion and employed a mixed-integer linear programming approach for problem modeling. They introduced a constructive heuristic algorithm and a water wave optimization algorithm based on problem-specific knowledge. Niu et al. [7] addressed the distributed group BFSP with carryover sequence-dependent setup time constraints. They proposed a two-stage cooperative coevolutionary algorithm aiming to minimize the makespan and total energy consumption. Zhao et al. [8] investigated the distributed BFSP with sequence-dependence, taking into account makespan, total tardiness, and total energy consumption. They introduced a cooperative whale optimization algorithm for solving this problem. Bao et al. [9] focused on the sequence-dependent BFSP with energy-aware considerations and constructed a mixed-integer linear programming model to minimize makespan and total energy consumption. They proposed a cooperative iterated greedy algorithm based on Q-learning. Nagano et al. [10] addressed the permutation flow shop problem with process blocking and setup times and presented an improved branch-and-bound algorithm with the objective of minimizing total flow time and tardiness. However, traditional flow shop scheduling lacks flexibility, and production lines are often singular. In contrast, the BHFSP allows for one or more parallel machines at each operation, providing adaptability to various production tasks. This not only reduces costs for enterprises but also enhances production efficiency.
Many researchers have conducted studies on HFSP with blocking constraints in recent years. Wang et al. [11] proposed a hybrid decode-assisted mutation iterative greedy algorithm for BHFSP with the objective of minimizing the makespan. Qin et al. [12] proposed a mathematical model of BHFSP based on energy-saving criteria and an improved iterative greedy algorithm based on an exchange strategy to minimize total energy consumption. Shao et al. [13] studied the distributed heterogeneous BHFSP, where the objective function is to minimize the makespan, and proposed a learning-based selection hyper-heuristic framework. Missaoui et al. [14] studied BHFSP where the objective function is to minimize the sum of weighted earliness and tardiness and proposed an efficient iterated greedy approach. Aqil et al. [15] studied BHFSP under the constraint of sequence-dependent setup time where the objective function is to minimize the total tardiness and earliness and proposed six algorithms based on the migratory bird optimization and water wave optimization. Qin et al. [16] established a mathematical model of BHFSP, where the objective is to minimize the makespan, and designed an iterative greedy algorithm with a double-level mutation strategy. Zhao et al. [17] proposed a cooperative monarch butterfly optimization algorithm to solve the distributed assembly blocking flow shop scheduling problem, where the optimization objective is to minimize the assembly completion time. Wang et al. [18] investigated the BHFSP on batch processing machines. Their objective was to minimize the total energy consumption of machines and the makespan. They designed a hybrid meta-heuristic algorithm based on ant colony optimization and genetic algorithms to solve this problem. It can be observed that most research on BHFSP primarily focuses on single-objective optimization, where the main optimization objectives are makespan, tardiness, or energy consumption. In light of the increasingly severe environmental challenges, the consideration of coordinated optimization among multiple objectives, such as completion time and energy consumption, is not only crucial for enhancing economic benefits for enterprises but also contributes to achieving sustainable development goals and alleviating environmental burdens.
In the research on multi-objective HFSP, Feng et al. [19] studied HFSP based on the parallel sequential movement mode, where the optimization objective is to maximize both the makespan and handling events, and proposed an improved non-dominated sorting genetic algorithm (NSGA-II) to find Pareto solutions. Lei et al. [20] focused on the optimization objectives of minimizing the makespan and maximizing the tardiness and designed an optimization algorithm based on multi-class teaching to solve the distributed HFSP with sequence-dependent setup times. Geng et al. [21] aimed to minimize the makespan and maximize the average agreement index and designed a hybrid NSGA-II algorithm to solve the fuzzy re-entrant HFSP. Wu et al. [22] studied the re-entrant HFSP with continuous batch processing machines and proposed an improved multi-objective evolutionary algorithm based on decomposition to reduce the production cycle and energy consumption in the production of cold-drawn seamless steel pipes. Wang et al. [23] aimed to minimize the makespan, the total energy consumption, and the processing cost of the machine and proposed an improved decomposition-based multi-objective evolutionary algorithm to solve the HFSP. Song et al. [24] aimed to minimize both the energy consumption and the makespan and proposed an improved fast NSGA-II to solve the HFSP. Lei et al. [25] solved the distributed two-stage HFSP considering sequence-dependent setup times and proposed an improved shuffled frog leaping algorithm to minimize the number of tardy jobs and the makespan simultaneously. Song et al. [26] aimed to minimize completion time and energy consumption and proposed a hybrid multi-objective teaching–learning-based optimization algorithm based on decomposition to solve the HFSP with an unrelated parallel machine. Li et al. [27] investigated energy-efficient HFSP with uniform machines and formulated a new multi-objective mixed-integer nonlinear programming model to minimize total tardiness, total energy cost, and carbon trading cost. They introduced the NSGA-II based on Q-learning and general variable neighborhood search. Wang et al. [28] explored HFSP with dynamic reconfiguration processes and the dual objectives of minimizing the makespan and the whole device’s energy consumption. They obtained a Pareto-based optimal solution set using an improved multi-objective whale optimization algorithm. Cui et al. [29] studied a multi-objective HFSP with unrelated parallel machines, considering minimum makespan and total tardiness, and designed an enhanced multi-population genetic algorithm for solution optimization. In summary, it is essential to consider the impact of transportation time on scheduling results in the context of multi-objective HFSP.
Traditional HFSP solving methods often employ intelligent optimization algorithms and heuristic algorithms. For complex shop scheduling problems that are difficult to solve, reinforcement learning can learn the optimal strategy through interaction with the environment, and its application in the field of scheduling is becoming increasingly widespread. Currently, reinforcement learning has been researched in various settings, including single machines [30], parallel machines [31], flow shops [32,33], job shops [34,35], and flexible job shops [36]. Particularly in the context of reinforcement learning for solving multi-objective problems, Zhang et al. [37] conducted research on the distributed HFSP with a certain degree of symmetry. The objective is to minimize both the makespan and the number of tardy jobs and propose a dual-population genetic algorithm based on Q-learning. Cheng et al. [38] designed a multi-objective Q-learning hyper-heuristic algorithm based on Bi-criteria selection, where the objective is to optimize both production efficiency and energy consumption simultaneously. Chang et al. [39] studied the multi-objective dynamic flexible job shop scheduling problem (MODFJSP) and proposed a hierarchical reinforcement learning approach to solve the MODFJSP considering the arrival of random jobs. Li et al. [40] conducted research on the multi-objective flexible job shop scheduling problem with fuzzy processing times, where the optimization objectives are makespan and total machine workload, and proposed a reinforcement learning-based multi-objective optimization algorithm. Yuan et al. [41] studied the multi-objective optimization scheduling problem in heterogeneous cloud environments and proposed a multi-objective reinforcement learning job scheduling method with AHP-based weighting. Wu et al. [42] studied the green dynamic multi-objective scheduling problem in a re-entrant hybrid flow shop and proposed an improved Q-learning algorithm. To sum up, when dealing with multi-objective problems, reinforcement learning algorithms typically employ a weighted summation of multiple objectives to transform them into a single objective. Objective weights are typically determined according to expert experience or experimentation. However, fixed weights are challenging to adapt in real time to changes in the state of the problem, thereby affecting the quality of the solutions.
As mentioned above, previous research in BHFSP has predominantly focused on single-objective optimization, with limited consideration for the coordinated transportation between upstream and downstream. Since the machines are at different geographical locations, transportation times have an impact on scheduling systems. Moreover, optimizing a single objective has inherent limitations when dealing with complex and diverse problems. Therefore, this paper investigates a multi-objective scheduling problem in a two-stage blocking hybrid flow shop with transportation constraints. In multi-objective optimization, determining objective weights often relies on expert experience or experiments. However, fixed weights are challenging to adapt in real time to changes in problem states, affecting the quality of solutions. This paper introduces a Q-learning algorithm based on adaptive objective selection. The algorithm better adapts to dynamic problem changes, enhancing solution flexibility and robustness. The detailed contributions of this paper are as follows:
(1)
For the problem of modern industrial process manufacturing, due to production process requirements, downstream machine congestion can result in upstream blocking, and the transportation time between upstream and downstream cannot be ignored. This paper formulates the HFSP with both transportation and blocking constraints. With the optimization objectives of minimizing the makespan and the total energy consumption, a two-stage BHFSP model incorporating transportation is established.
(2)
We have designed an improved multi-objective Q-learning algorithm to address this model. Additionally, an adaptive object selection strategy based on t-tests has been developed for handling multi-objective optimization problems. This strategy coordinates the selection of different objectives by evaluating the confidence of the objective functions under the current job and machine state, thus optimizing both completion time and energy consumption indicators effectively.
The rest of this paper Is organized as follows: Section 2 establishes the mathematical model of the two-stage BHFSP with transportation times. Section 3 describes the implementation details of the Q-learning algorithm based on adaptive object selection. In Section 4, numerical experiments are conducted to demonstrate the effectiveness of the proposed algorithm. Finally, in Section 5, conclusions are drawn, and future research directions are proposed.

2. Problem Formulation

The two-stage BHFSP with transportation times can be described as follows: There are n jobs that need to go through s (s = 1, 2) processing stages, each of which has multiple identical parallel machines, and each machine is located at a different geographical location. The processing sequence for all jobs is the same, and each job can be processed on any machine at each stage. The jobs processed in the first stage are transported to the production machines in the next stage by the transport vehicles. There is no buffer between stages, meaning that once a job completes its processing in the previous stage, it can only leave the machine when the next stage has available machines. The waiting time of the job is referred to as the blocking time. The objective function is to minimize both the makespan and the total energy consumption.
We assume that:
(1)
All jobs have arrived at time zero and can begin processing.
(2)
There is no limit to the number of transport vehicles that can be used after the job leaves the first-stage machine.
(3)
Once the job begins processing or transporting, it cannot be interrupted.
The parameters and decision variables are defined as follows:
J: set of jobs, J = {1, 2, …, n};
Ms: set of machines, Ms = {1, 2, …, ms};
j: index of a job, j = 1, 2, …, n;
i: index of the first-stage machine, i = 1, 2, …, m1;
k: index of the second-stage machine, k = 1, 2, …, m2;
psj: the processing time of job j at stage s;
tik: the transportation time of the job from machine i to machine k;
SPi: the blocking power of a job on machine i in the first stage per unit of time;
TPik: the transportation power of a job from machine i to machine k per unit of time;
M: the large positive number;
Aj: the arrival time of job j;
Bsj: the start time of job j at stage s;
Csj: the completion time of job j at stage s;
L1j: the leave time of job j in the first stage;
π: the feasible overall scheduling solution;
tj(π): the transportation time of job j under the scheduling solution π;
wj(π): the waiting time of job j before processing in the first stage under the scheduling solution π;
bj(π): the blocking time of job j on the first-stage machine under the scheduling solution π;
Cmax: the makespan of job j;
TEC: the total energy consumption;
Xij: it is equal to 1 if job j is processed on machine i; otherwise, it is equal to 0;
Yjk: it is equal to 1 if job j is processed on machine k; otherwise, it is equal to 0;
  • Makespan: The factors affecting the completion time of the job include processing time, transportation time, waiting processing time, and blocking time. The formula is defined as follows:
min f 1 = C max = max ( C 1 , , C j , C n )
C j = p 1 j + p 2 j + t j ( π ) + w j ( π ) + b j ( π )
t j ( π ) = i = 1 m 1 k = 1 m 2 X i j Y j k t i k
w j ( π ) = B 1 j A j
b j ( π ) = L 1 j C 1 j
where Equation (1) represents the objective function to minimize the makespan. Equation (2) defines the completion time of job j as the sum of processing time, transportation time, waiting processing time, and blocking time. Equation (3) represents the transportation time of job j. Equation (4) defines the waiting processing time of job j before the first stage as the difference between its start processing time in the first stage and its arrival time. Equation (5) defines the blocking time of job j on the first-stage machine as the difference between its leave time on the first-stage machine and its completion time.
2.
Total energy consumption: TEC includes blocking energy consumption (EC1), transportation energy consumption (EC2), and processing energy consumption (EC3). Notably, EC3 for each job is solely dependent on its processing time. Since each stage is equipped with identical parallel machines, EC3 is not affected by different processing sequences and remains constant. Therefore, Equation (6) shows that minimizing TEC requires minimizing EC1 and EC2. The second objective function is as follows:
min f 2 = T E C = E C 1 + E C 2
E C 1 = j = 1 n i = 1 m 1 ( X i j L 1 j X i j C 1 j ) S P i
E C 2 = j = 1 n i = 1 m 1 k = 1 m 2 X i j Y j k t i k T P i k
where Equation (7) defines EC1 as the energy consumed when a job is blocked on a machine. It is equal to the product sum of the blocking time of the job and the corresponding blocking power of the machine. Equation (8) defines EC2 as the energy consumed when a vehicle transports a job. It is equal to the product of the transportation time of the job and the transportation power.
The following mathematical model is established based on the above problems:
min { C max , T E C }
s.t
i = 1 m 1 X i j = 1 , k = 1 m 2 Y j k = 1 j J
C 1 j = B 1 j + i = 1 m 1 p 1 j X i j i M 1 , j J
C j = B 2 j + k = 1 m 2 p 2 j Y j k k M 2 ,   j J
L 1 j + t j ( π ) = B 2 j j J
B 1 j ' L 1 j M ( 2 X i j X i j ' ) j , j ' J
C 1 j L 1 j j J
B s j 0 j J
X i j = 1 , if   job   j   is   processed   on   machine   i 0 , otherwise i M 1 , j J
Y j k = 1 , if   job   j   is   processed   on   machine   k , 0 , otherwise k M 2 , j J
where Equation (9) is the objective function and Equations (10)–(16) are constraints. Equation (10) represents a job that can only be processed by one machine at each stage. Equations (11) and (12) define the completion time of a job as the sum of its start processing time and its processing time. Equations (13) and (14) represent blocking constraints. Equation (13) defines the start processing time of a job in the second stage as the sum of its transportation time and its leave time in the first stage. Equation (14) represents when the start time of the latter processing job j′ on the same machine cannot be shorter than the leave time of the former processed job j. Equation (15) represents a job that can only leave after the operation is finished. Equation (16) represents when the start processing time of a job must be greater than or equal to 0. Equations (17) and (18) represent constraints on decision variables.

3. Adaptive Objective Selection Q-Learning Algorithm

The two-stage BHFSP model with transportation time established in Section 2 is formulated as a multi-objective mixed-integer programming model. HFSP has been proven to be an NP-hard problem [43], and due to the complexity of the problem studied in this paper, it is also NP-hard. Reinforcement learning enables autonomous learning through interaction between agents and the environment. It can adapt to diverse tasks and environments while achieving continuous improvement, giving it an advantage in intelligent decision-making and scheduling. In this section, an adaptive objective selection Q-learning algorithm (AQL) for solving multi-objective scheduling problems is designed. The confidence of the two objective functions is computed using a t-test, allowing for a focus on optimizing the object with the highest confidence.

3.1. Problem Transformation

3.1.1. State

The state feature mainly shows the environmental features of the blocking hybrid flow shop, including real-time information on machines, jobs, and the waiting processing queues before the two stages. fj,1 represents the state of job j; fi,2 represents the working state of machine i in the first stage; fk,3 represents the working state of machine k in the second stage; and fs,4f9 represents the environmental state features of the waiting processing queues. Therefore, this paper studies the two-stage BHFSP with transportation time, and there are n + m1 + m2 + 11 states in the whole environment. The definitions of various state features are shown as follows: Q1 represents the waiting processing queue before the first stage and Q2 represents the processing queue blocked on the machines.
State 1 The five states of the job j.
f j , 1 = 0 , wait   for   the   first   stage   of   machine   processing 1 ,   is   in   the   first   stage   of   machine   processing 1 ,   is   blocking   on   the   machine   in   the   first   stage 1 / 2 ,   is   in   the   sec ond   stage   of   machine   processing 1 / 2 ,   complete   the   sec ond   stage   of   processing , j = 1 , 2 , , n
State 2 The working state of machine i in the first stage.
f i , 2 = 0 , the   machine   i   is   idle   1 , the   machine   i   is   busy , i = 1 , 2 , , m 1
State 3 The working state of machine k in the second stage.
f k , 3 = 0 , the   machine   k   is   idle   1 , the   machine   k   is   busy , k = 1 , 2 , , m 2
State 4 The ratio of the number of all jobs in queue Qs to the total number of jobs.
f s , 4 = η ( Q s ) n , s = 1 , 2
State 5 The ratio of the average processing time of all jobs in queue Qs to the average processing time of the job on the machine at this stage.
f s , 5 = j Q s p s j η ( Q s ) n j = 1 n p s j , Q 1 0 ; s = 1 , 2
State 6 Whether the job with minimum processing time is in queue Qs.
f s , 6 = 0 , the   job   with   the   minimum   processing             time   is   not   in   queue   Q s 1 , otherwise , s = 1 , 2
State 7 The ratio of the maximum processing time of a job in queue Qs to the maximum processing time of all jobs.
f s , 7 = max j Q s ( p s j ) max j N ( p s j ) , s = 1 , 2
State 8 The ratio of the minimum processing time of a job in queue Qs to the maximum processing time of all jobs.
f s , 8 = min j Q s ( p s j ) max j N ( p s j ) , s = 1 , 2
State 9 The ratio of the number of jobs in queue Q1, whose processing time in the first stage exceeds that in the second stage, to the number of jobs in queue Q1.
f 9 = η ( J Q 1 ) η ( Q 1 ) , J Q 1 = { J j | p 1 j > p 2 j , J j Q 1 } ; Q 1 0

3.1.2. Action

The actions are designed based on scheduling rules such as SPT, FCFS, Johnson, etc. These scheduling rules are primarily adopted to allocate waiting jobs to machines. When a machine is idle, a job can select it for processing; when a machine is busy, a job cannot choose that machine. If there are no available machines at a certain moment, the job can only be blocked at the current stage. Based on this, six actions are set in the first stage and four actions are set in the second stage, for a total of twenty-four joint actions.
  • The First Production Stage
Action 1 SPT: Process the jobs in queue Q1 in p1j ascending order, selecting the job with the shortest processing time.
Action 2 LPT: Process the jobs in queue Q1 in p1j descending order, selecting the job with the longest processing time.
Action 3 SPT + SSO: Process the jobs in queue Q1 in p1j + p2j ascending order, selecting the job with the shortest total processing time.
Action 4 LPT + LSO: Process the jobs in queue Q1 in p1j + p2j descending order, selecting the job with the longest total processing time.
Action 5 Johnson–Bellman: Divide the set of jobs in queue Q1 into two subsets, SJ1 and SJ2. SJ1 contains the set of jobs where p1j < p2j, and SJ2 contains the remaining jobs. Then, apply the SPT rule to select jobs from SJ1 and the LPT rule to select jobs from SJ2.
Action 6 Select no job: Select this action when there are no jobs in queue Q1 or the machines in the first stage are busy.
  • The Second Production Stage
Action 7 SPT: Process the jobs in queue Q2 in p2j ascending order, selecting the job with the shortest processing time.
Action 8 LPT: Process the jobs in queue Q2 in p2j descending order, selecting the job with the longest processing time.
Action 9 FCFS: Process the jobs in queue Q2 in an ascending order of completion time, selecting the job that finishes first.
Action 10 Select no job: Select no job: Select this action when there are no jobs in queue Q2 or the machines in the second stage are busy.

3.1.3. Reward

The reward function represents the immediate feedback received after performing an action in the current state and is usually related to the objective function. Therefore, rewards based on the makespan and the energy consumption are defined as follows: r t 1 represents reward 1 obtained at decision moment t and r t 2 represents reward 2 obtained at decision moment t. They are defined as follows:
r t 1 = f 1 ( t 1 ) f 1 ( t )
r t 2 = f 2 ( t 1 ) f 2 ( t )
where f1(t) represents the makespan of the currently processed job at decision moment t and f2(t) represents the energy consumption already generated at decision moment t. They are represented as follows:
f 1 ( t ) = max { C s j ( t ) j | j C J ( t ) }
f 2 ( t ) = j C J ( t ) i M 1 ( X i j L 1 j X i j C 1 j ) S P i + j C J ( t ) i M 1 k M 2 X i j Y j k t i k T P i k
where sj(t) represents the number of operations completed on job j at decision moment t and CJ(t) represents the set of jobs processed at decision moment t.
Based on the rewards at each decision moment, the cumulative rewards obtained are as follows:
R 1 = t = 1 T r t 1 = t = 1 T f 1 ( t 1 ) f 1 ( t ) = f 1 ( 0 ) f 1 ( T ) = C max
R 2 = t = 1 T r t 2 = t = 1 T f 2 ( t 1 ) f 2 ( t ) = f 2 ( 0 ) f 2 ( T ) = T E C
where T represents the last decision moment when all the jobs have been processed and f1(T) and f2(T) represent Cmax and TEC, respectively. Since no processing operations have been performed at the initial moment, f1(0) = f2(0) = 0.

3.2. Value Function Approximation

The basic idea of the Q-learning algorithm is to guide the agent to make decisions by learning a Q-value function to maximize long-term cumulative rewards. The Q-value function Q(s, a) represents the expected cumulative reward achievable by taking action a in state s. To simplify the problem and reduce computational complexity, state discretization is employed. In this paper, a parameterized approximation approach is used to update the state-value function by updating the weight of the basis function. The update formula is as follows:
Q ( s , a ) = z = 1 m 1 + m 2 + n + 11 w z a φ z ( s )
where φz(s) represents the vector of basis functions in the state space and w z a is the weight for selecting action a in the current state sz. The normalization of the basis functions is shown in Equation (35).
φ z ( s ) = f z , 1 , 1 z n f z , 2 , n + 1 z n + m 1 f z , 3 , n + m 1 + 1 z n + m 1 + m 2 f z , 4 , n + m 1 + m 2 + 1 z n + m 1 + m 2 + 2 f z , 5 , n + m 1 + m 2 + 3 z n + m 1 + m 2 + 4 f z , 6 , n + m 1 + m 2 + 5 z n + m 1 + m 2 + 6 f z , 7 , n + m 1 + m 2 + 7 z n + m 1 + m 2 + 8 f z , 8 , n + m 1 + m 2 + 9 z n + m 1 + m 2 + 10 f z , 9 , z = n + m 1 + m 2 + 11

3.3. T-Test-Based Adaptive Objective Selection

Multi-objective reinforcement learning often employs a linear scalarization approach to address multiple objectives, with the primary challenge lying in determining the objective weights. In reinforcement learning, objective weights are typically globally fixed and do not adapt to the dynamically changing problem state space. To address this issue, we integrate objective weights with the problem state, representing the weights as functions of the state. By combining t-tests with confidence, we propose an adaptive objective selection strategy.
The basic idea of adaptive objective selection is the parallel estimation of the Q-function for object o. When action selection is required, a t-test is employed to calculate the confidence of the objective function in the current state, determining the objective where the agent has the highest confidence. As a result, the Q-value of the current object is selected to make action decisions. By using a t-test to calculate confidence, it is possible to demonstrate the significant differences in distributions based on each sample. This allows for a more targeted object selection and weight allocation. The specific steps of the algorithm are as follows:
Step 1: Select x = 10 most recently observed r o + max Q ( s , a , o ) and add them to the sample set SAo.
Step 2: Calculate the confidence levels of each objective function using a t-test. The calculation formula is as follows:
t o = x ¯ o Q ( s , a v , o ) g o x , v [ 1 , 24 ]
where x ¯ o is the mean of samples in SAo and go is the standard deviation of samples in SAo. Find po in the T-bound table with confidence level 1-po.
Step 3: Put the confidence level 1-po of all objective functions at state s into the set co, and the expression is as follows:
c o = c o n f i d e n c e ( ( s , a 1 , o ) , , ( s , a 24 , o ) )
Step 4: Define μo(s) as the weight of the o-th objective function at state s. Select the objective function with the highest confidence level.
μ o ( s ) = 1 i f   o = arg max c o 0 e l s e
Step 5: Select an action based on the objective function with the highest confidence level.

3.4. Algorithm Framework

The visualization in Figure 2 shows the specific implementation process of the algorithm. The scheduling system is in the initial state s0 at the start of the processing time. At this point, all jobs are in the waiting processing queue Q1, and all machines are idle. Then, an action is selected based on the ε-greedy strategy, which involves selecting a job from queue Q1 and an idle machine from the first stage for processing until either the set of idle machines or the set of jobs in queue Q1 becomes empty. Following this, the blocking queue Q2 and the set of idle machines before the second stage are evaluated. If they exist, the machine and the job are selected based on the actions. The scheduling system reaches the termination state sT, where all processing queues are empty and all jobs have been handled, resulting in a scheduling solution.
The specific steps of the AQL algorithm are as follows:
Step 1: Initialize parameters.
Step 1.1: Input parameters of the scheduling problem: the number of jobs n, the number of machines in the first stage m1, the number of machines in the second stage m2, the processing time of each job in the two-stage machines psj, the transportation time between machines tik, the blocking power of the machine in the first stage SPi, and the transportation power of the transporter TPik.
Step 1.2: Input parameters of the Q-learning algorithm: learning rate α, discount factor γ, greedy factor ε, decay rate λ, and two m1 + m2 + n + 11 dimensional vectors E(a) = (0, 0, …, 0)T, wa = (1, 1, …, 1)T; max_episode, with the current iteration g = 1.
Step 2: Set the initial time t0 and initial state s0, and initialize two Q(s, a) tables.
Step 3: Utilize a t-test to calculate the confidence of the objective function in the current state and determine the object o where the agent has the highest confidence.
Step 4: Use the ε-greedy strategy, where we obtain a probability of ε to randomly select an action and a probability of 1 − ε to select the action with the highest Q-value from the Q-table.
Step 5: Confirm the state transition time, calculate the reward, and update the Q-table. The reward r(st, at, st+1) is gained by taking action at from state st to st+1, then updating the basis function weights wza, hence updating the Q-table. The update process is as follows:
w z a = w z a + α δ E
δ = r t o ( s t , a t , s t + 1 ) + γ max Q ( s t + 1 , a t + 1 , o ) Q ( s t , a t , o )
E = λ E ( a t ) + w z a Q ( s t , a t , o )
Step 6: If the number of jobs that the machine processed in the second stage < n, return to Step 3; otherwise, execute Step 7.
Step 7: If the current iteration number < max_episode, g = g + 1, return to Step 2; otherwise, the algorithm is terminated.

4. Numerical Experiments

4.1. Experimental Environment and Parameter Setting

To validate the effectiveness of the Q-learning algorithm, we designed the following instances for simulation analysis. The experiments are carried out on an Intel(R) Core(TM) i5-7200U CPU 2.50 GHz processor, 20 GB RAM, PyCharm2017.3.2 compiler, and python3.7 interpreter software.
The parameter settings are as follows: n = 15, m1 = 2, m2 = 3, psj and tik are generated at random between [1, 50], and SPi and TPik are generated at random between [1, 10].
The initial parameter settings for the AQL algorithm, including α, γ, ε, and λ, are shown in Table 1. These NP values are obtained using an orthogonal experiment conducted according to the L9(34) rule.
Table 2 shows the values of the two objective functions for nine different parameter selections. The performance of the proposed model is evaluated using the Normalized Performance (NP). A smaller NP indicates better performance. The definition of NP is as follows:
N P = C max min M C max M C min M C + T E C min M T max M T min M T
In Equation (42), MC represents the set of all Cmax values and MT represents the set of all TEC values.
Upon summarizing the results from Table 2, for α = 0.001, the K1 result is the sum of the total NP when the parameter is set to 0.001. The summarized results are shown in Table 3.
From Table 3, the minimum values are α = 2.59, γ = 2.56, ε = 1.54, and λ = 2.49. The final parameter values obtained are α = 0.1, γ = 0.99, ε = 0.2, and λ = 0.1.
Figure 3 shows the convergence of the objective value when the algorithm iterates up to 1000 generations under the above experimental parameters. It can be seen that the algorithm tends to converge around the 200th iteration. Hence, for this experiment, a maximum of 200 iterations is chosen.

4.2. Experimental Results and Analysis

To validate the effectiveness of the model and the algorithm, experiments are conducted with the machine set to m1 = 2 and m2 = 3 and the job set to n = 4 and n = 6. As shown in Table 4, the two objective function values for this problem are obtained using the Gurobi solver and the AQL algorithm. Notably, a computation time limit of 1800 s is set for the Gurobi solver.
As we can see, when n = 4, Gurobi finds the optimal solution in less than 1 s. For n = 7, it consumes a significantly longer computation time, close to 1195 s. For n = 8, the computation time has already reached its limit, and an optimal solution cannot be achieved. The proposed AQL algorithm performs less effectively than the Gurobi solver in solving the first objective value. However, as the problem scale increases, the performance between AQL and Gurobi narrows. More importantly, AQL has a much lower computation time compared to the Gurobi solver. Therefore, as the instance scale increases, AQL can provide optimal solutions in a shorter time compared to the Gurobi solver.
Experiments are conducted with n = 30 in three sets of machine: m1 = 3, m2 = 5; m1 = 5, m2 = 5; and m1 = 7, m2 = 5. Table 5 shows the scheduling solutions obtained by the AQL algorithm for these three sets of experimental scales. Figure 4, Figure 5 and Figure 6 show the corresponding scheduling Gantt charts, where the black areas represent the blocked portions.
Figure 4, Figure 5 and Figure 6 show that the performance of the AQL algorithm is influenced by different production configurations. As the number of machines increases in the first stage, the blocking time also increases, leading to higher TEC. Therefore, in real-world production environments, it is possible to reduce the risk of job blocking and improve the efficiency and stability of production by designing the layout of the production line.

4.3. Experimental Comparison

To further validate the effectiveness of the algorithm, the performance of the AQL algorithm is compared with individual scheduling rules at different experimental scales. Furthermore, comparative analyses are conducted for AQL between Q-learning and NSGA-II, where the Q-learning algorithm linearly weights multiple objective functions as rewards for solving the problem.
Table 6, Table 7 and Table 8 show the comparative results of the two objective function values obtained by the AQL algorithm and individual scheduling rules at different machine scales, where n is set to 15, 30, 50, and 100, respectively. Figure 7 shows the comparison graph of the frequency of selecting different actions under different machine scales when AQL solves n = 15.
From Table 6, Table 7 and Table 8, it is evident that in 92% of the test instances, AQL consistently achieves lower makespan and TEC. Compared to individual heuristic rules, AQL shows an average improvement in Cmax values ranging from a maximum of 21.2% to a minimum of 7.4%. Similarly, for TEC values, AQL demonstrates an average improvement ranging from a maximum of 37.4% to a minimum of 13.5%. This indicates that AQL can consistently find scheduling rules that result in better objective values at different scales. From the results with the superscript (*), it is apparent that the worst outcomes are evenly distributed across rules other than the SPT + SSO in the first stage. That is, none of the results obtained under the SPT + SSO rule are the worst. Combined with Figure 7, we can see that AQL selects the SPT + SSO rule significantly more often than other scheduling rules. It shows that AQL can find the scheduling rule that makes the objective value better at each decision point.
To validate the advantages of the AQL algorithm in solving multi-objective problems, the performance of AQL is compared with NSGA-II and the Q-learning algorithm, respectively. Table 9 shows the comparative results of the objective function values for each algorithm at different scales.
Table 9 presents the experimental results comparing AQL with NSGA-II and Q-learning algorithms. The experimental results indicate that, compared to the Q-learning algorithm, the AQL algorithm achieves lower Cmax values in 66.7% of the test instances when n = 15 and n = 30. As the number of jobs increases, both objective function values under the AQL algorithm outperform the Q-learning algorithm. This suggests that linearly weighting multiple objective functions as rewards is subjective. Compared to the NSGA-II, the Cmax values, on average, improved by 22.7%, and the TEC values increased by an average of 9.8%. In summary, the AQL algorithm significantly outperforms both Q-learning and NSGA in solving multi-objective problems, demonstrating its superiority.

5. Conclusions

This paper, set against the backdrop of a steel manufacturing enterprise, focuses on the various production processes and material transportation. Due to the stringent temperature requirements of materials, we specifically investigated the two-stage BHFSP with transportation times. It formulates a multi-objective scheduling model with the objective of minimizing both the makespan and the total energy consumption. We designed the AQL algorithm to solve the model. Nine state features were designed based on real-time information about jobs, machines, and waiting processing queues in the blocked hybrid flow shop environment. Ten actions were formulated based on heuristic rules like SPT, FCFS, and Johnson. We proposed an adaptive objective selection strategy based on t-tests, wherein the algorithm calculates confidence to determine the most confident goal for the current action selection without relying on the fixed objective weights. Simulation analyses were conducted at different experimental scales, comparing single scheduling rules, the Q-learning algorithm, and NSGA-II. The experimental results demonstrate that the AQL algorithm can achieve optimal scheduling solutions in 92%, 83.3%, and 91.7% of the test instances, respectively. This research helps to optimize the production and transportation processes in process industries, reducing the impact of blocking and transportation time on completion time, and improving resource utilization. Additionally, this approach allows enterprises to consume less energy in terms of blocking and transportation. This is consistent with the research direction of green manufacturing mode in modern production.
The problem studied in this paper does not consider the number and capacity limitations of transportation vehicles. Future research can explore the coordination of production and transportation scheduling problems in multi-processing stage blocking hybrid flow shop environments when transportation resources are constrained.

Author Contributions

Conceptualization, K.X.; methodology, C.Y. and K.X.; software, C.Y. and W.S.; validation, K.X. and H.G.; formal analysis, H.G. and W.S.; resources, W.S.; data curation, C.Y.; writing—original draft preparation, C.Y.; writing—review and editing, K.X.; super-vision, K.X.; project administration, H.G.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project of Liaoning BaiQianWan Talents Program under Grand No. 2021921089, the Science Research Foundation of Educational Department of Liaoning Province under Grand No. LJKQZ2021057 and LJKZ2060, and the Liaoning Province Xingliao Talents Plan project under Grant No. XLYC2006017.

Data Availability Statement

All data from the experiments are included in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cheng, Q.; Liu, C.; Chu, H.; Liu, Z.; Zhang, W.; Pan, J. A New Multi-Objective Hybrid Flow Shop Scheduling Method to Fully Utilize the Residual Forging Heat. IEEE Access 2020, 8, 151180–151194. [Google Scholar] [CrossRef]
  2. Wardono, B.; Fathi, Y. A tabu search algorithm for the multi-stage parallel machine problem with limited buffer capacities. Eur. J. Oper. Res. 2004, 155, 380–401. [Google Scholar] [CrossRef]
  3. Du, S.; Zhou, W.; Wu, D.; Fei, M. An effective discrete monarch butterfly optimization algorithm for distributed blocking flow shop scheduling with an assembly machine. Expert Syst. Appl. 2023, 225, 120113. [Google Scholar] [CrossRef]
  4. Miyata, H.H.; Nagano, M.S.; Gupta, J.N.D. Solutions methods for m-machine blocking flow shop with setup times and preventive maintenance costs to minimise hierarchical objective-function. Int. J. Prod. Res. 2023, 61, 6308–6335. [Google Scholar] [CrossRef]
  5. Cheng, C.-Y.; Pourhejazy, P.; Ying, K.-C.; Huang, S.-Y. New benchmark algorithm for minimizing total completion time in blocking flowshops with sequence-dependent setup times. Appl. Soft Comput. 2021, 104, 107229. [Google Scholar] [CrossRef]
  6. Zhao, F.; Shao, D.; Wang, L.; Xu, T.; Zhu, N.; Jonrinaldi. An effective water wave optimization algorithm with problem-specific knowledge for the distributed assembly blocking flow-shop scheduling problem. Knowl.-Based Syst. 2022, 243, 108471. [Google Scholar] [CrossRef]
  7. Niu, W.; Li, J. A two-stage cooperative evolutionary algorithm for energy-efficient distributed group blocking flow shop with setup carryover in precast systems. Knowl.-Based Syst. 2022, 257, 109890. [Google Scholar] [CrossRef]
  8. Zhao, F.; Xu, Z.; Bao, H.; Xu, T.; Zhu, N.; Jonrinaldi. A cooperative whale optimization algorithm for energy-efficient scheduling of the distributed blocking flow-shop with sequence-dependent setup time. Comput. Ind. Eng. 2023, 178, 109082. [Google Scholar] [CrossRef]
  9. Bao, H.; Pan, Q.; Ruiz, R.; Gao, L. A collaborative iterated greedy algorithm with reinforcement learning for energy-aware distributed blocking flow-shop scheduling. Swarm Evol. Comput. 2023, 83, 101399. [Google Scholar] [CrossRef]
  10. Nagano, M.; Takano, M.; Robazzi, J. A branch and bound method in a permutation flow shop with blocking and setup times. Int. J. Ind. Eng. Comput. 2022, 13, 255–266. [Google Scholar] [CrossRef]
  11. Wang, Y.; Wang, Y.; Han, Y. A Variant Iterated Greedy Algorithm Integrating Multiple Decoding Rules for Hybrid Blocking Flow Shop Scheduling Problem. Mathematics 2023, 11, 2453. [Google Scholar] [CrossRef]
  12. Qin, H.-X.; Han, Y.-Y.; Zhang, B.; Meng, L.-L.; Liu, Y.-P.; Pan, Q.-K.; Gong, D.-W. An improved iterated greedy algorithm for the energy-efficient blocking hybrid flow shop scheduling problem. Swarm Evol. Comput. 2022, 69, 100992. [Google Scholar] [CrossRef]
  13. Shao, Z.; Shao, W.; Pi, D. LS-HH: A learning-based selection hyper-heuristic for distributed heterogeneous hybrid blocking flow-shop scheduling. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 111–127. [Google Scholar] [CrossRef]
  14. Missaoui, A.; Boujelbene, Y. An effective iterated greedy algorithm for blocking hybrid flow shop problem with due date window. RAIRO-Oper. Res. 2021, 55, 1603–1616. [Google Scholar] [CrossRef]
  15. Aqil, S.; Allali, K. Two efficient nature inspired meta-heuristics solving blocking hybrid flow shop manufacturing problem. Eng. Appl. Artif. Intell. 2021, 100, 104196. [Google Scholar] [CrossRef]
  16. Qin, H.-X.; Han, Y.-Y.; Chen, Q.-D.; Li, J.-Q.; Sang, H.-Y. A double level mutation iterated greedy algorithm for blocking hybrid flow shop scheduling. Control Decis. 2022, 37, 2323–2332. [Google Scholar]
  17. Zhao, F.-Q.; Du, S.-L.; Cao, J.; Tang, J.-X. Study on distributed assembly blocking flow shop scheduling algorithm. J. Huazhong Univ. Sci. Technol. (Nat. Sci. Ed.) 2022, 50, 138–142+148. [Google Scholar]
  18. Wang, Y.; Jia, Z.; Zhang, X. A hybrid meta-heuristic for the flexible flow shop scheduling with blocking. Swarm Evol. Comput. 2022, 75, 101195. [Google Scholar] [CrossRef]
  19. Feng, Y.; Kong, J. Multi-Objective Hybrid Flow-Shop Scheduling in Parallel Sequential Mode While Considering Handling Time and Setup Time. Appl. Sci. 2023, 13, 3563. [Google Scholar] [CrossRef]
  20. Lei, D.; Su, B. A multi-class teaching–learning-based optimization for multi-objective distributed hybrid flow shop scheduling. Knowl.-Based Syst. 2023, 263, 110252. [Google Scholar] [CrossRef]
  21. Geng, K.; Wu, S.; Liu, L. Multi-objective re-entrant hybrid flow shop scheduling problem considering fuzzy processing time and delivery time. J. Intell. Fuzzy Syst. 2022, 43, 7877–7890. [Google Scholar] [CrossRef]
  22. Wu, X.; Cao, Z. An improved multi-objective evolutionary algorithm based on decomposition for solving re-entrant hybrid flow shop scheduling problem with batch processing machines. Comput. Ind. Eng. 2022, 169, 108236. [Google Scholar] [CrossRef]
  23. Wang, J.; Wang, L.; Cai, J.; Li, J.; Su, X. Solution Algorithm of Multi-objective Hybrid Flow Shop Scheduling Problem. J. Nanjing Univ. Aeronaut. Astronaut. 2023, 55, 544–552. [Google Scholar]
  24. Song, C. Improved NSGA-II algorithm for hybrid flow shop scheduling problem with multi-objective. Comput. Integr. Manuf. Syst. 2022, 28, 1777–17889. [Google Scholar]
  25. Lei, D.-M.; Wang, T. An improved shuffled frog leaping algorithm for the distributed two-stage hybrid flow shop scheduling. Control Decis. 2021, 36, 241–248. [Google Scholar]
  26. Song, C. A hybrid multi-objective teaching-learning based optimization for scheduling problem of hybrid flow shop with unrelated parallel machine. IEEE Access 2021, 9, 56822–56835. [Google Scholar] [CrossRef]
  27. Li, P.; Xue, Q.; Zhang, Z.; Chen, Z.; Zhou, D. Multi-objective energy-efficient hybrid flow shop scheduling using Q-learning and GVNS driven NSGA-II. Comput. Oper. Res. 2023, 159, 106360. [Google Scholar] [CrossRef]
  28. Wang, Y.; Wang, S.; Li, D.; Shen, C.; Yang, B. An improved multi-objective whale optimization algorithm for the hybrid flow shop scheduling problem considering device dynamic reconfiguration processes. Expert Syst. Appl. 2021, 174, 114793. [Google Scholar]
  29. Cui, H.; Li, X.; Gao, L.; Zhang, C. Multi-population genetic algorithm with greedy job insertion inter-factory neighbourhoods for multi-objective distributed hybrid flow-shop scheduling with unrelated-parallel machines considering tardiness. Int. J. Prod. Res. 2023, 1–19. [Google Scholar] [CrossRef]
  30. Wang, J.; Li, X.; Zhu, X. Intelligent dynamic control of stochastic economic lot scheduling by agent-based reinforcement learning. Int. J. Prod. Res. 2012, 50, 4381–4395. [Google Scholar] [CrossRef]
  31. Zhang, Z.; Zheng, L.; Li, N.; Wang, W.; Zhong, S.; Hu, K. Minimizing mean weighted tardiness in unrelated parallel machine scheduling with reinforcement learning. Comput. Oper. Res. 2012, 39, 1315–1324. [Google Scholar] [CrossRef]
  32. Lee, J.-H.; Kim, H.-J. Reinforcement learning for robotic flow shop scheduling with processing time variations. Int. J. Prod. Res. 2022, 60, 2346–2368. [Google Scholar] [CrossRef]
  33. Zhao, F.; Zhang, L.; Cao, J.; Tang, J. A cooperative water wave optimization algorithm with reinforcement learning for the distributed assembly no-idle flowshop scheduling problem. Comput. Ind. Eng. 2021, 153, 107082. [Google Scholar] [CrossRef]
  34. Zhang, C.; Song, W.; Cao, Z.; Zhang, J.; Tan, P.S.; Chi, X. Learning to dispatch for job shop scheduling via deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 1621–1632. [Google Scholar]
  35. Li, Z.; Wei, X.; Jiang, X.; Pang, Y. A kind of reinforcement learning to improve genetic algorithm for multiagent task scheduling. Math. Probl. Eng. 2021, 2021, 1796296. [Google Scholar] [CrossRef]
  36. Luo, S. Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl. Soft Comput. 2020, 91, 106208. [Google Scholar] [CrossRef]
  37. Zhang, J.; Cai, J. A Dual-Population Genetic Algorithm with Q-Learning for Multi-Objective Distributed Hybrid Flow Shop Scheduling Problem. Symmetry 2023, 15, 836. [Google Scholar] [CrossRef]
  38. Cheng, L.; Tang, Q.; Zhang, L.; Zhang, Z. Multi-objective Q-learning-based hyper-heuristic with Bi-criteria selection for energy-aware mixed shop scheduling. Swarm Evol. Comput. 2022, 69, 100985. [Google Scholar] [CrossRef]
  39. Chang, J.; Yu, D.; Zhou, Z.; He, W.; Zhang, L. Hierarchical Reinforcement Learning for Multi-Objective Real-Time Flexible Scheduling in a Smart Shop Floor. Machines 2022, 10, 1195. [Google Scholar] [CrossRef]
  40. Li, R.; Gong, W.; Lu, C. A reinforcement learning based RMOEA/D for bi-objective fuzzy flexible job shop scheduling. Expert Syst. Appl. 2022, 203, 117380. [Google Scholar] [CrossRef]
  41. Yuan, J.-L.; Chen, M.-C.; Jiang, T.; Li, C. Multi-objective reinforcement learning job scheduling method using AHP fixed weight in heterogeneous cloud environment. Control Decis. 2022, 37, 379–386. [Google Scholar]
  42. Wu, X.; Yan, X. An Improved Q Learning Algorithm to Optimize Green Dynamic Scheduling Problem in a Reentrant Hybrid Flow Shop. J. Mech. Eng. 2022, 58, 246–259. [Google Scholar]
  43. Wang, M.Y.; Sethi, S.P.; van de Velde, S.L. Minimizing makespan in a class of reentrant shops. Oper. Res. 1997, 45, 702–712. [Google Scholar] [CrossRef]
Figure 1. Process flowchart for the heating–rolling stage.
Figure 1. Process flowchart for the heating–rolling stage.
Processes 12 00051 g001
Figure 2. The specific implementation procedure of the algorithm.
Figure 2. The specific implementation procedure of the algorithm.
Processes 12 00051 g002
Figure 3. Graph of AQL algorithm convergence.
Figure 3. Graph of AQL algorithm convergence.
Processes 12 00051 g003
Figure 4. Gantt chart of optimal scheduling for m1 = 3, m2 = 5.
Figure 4. Gantt chart of optimal scheduling for m1 = 3, m2 = 5.
Processes 12 00051 g004
Figure 5. Gantt chart of optimal scheduling for m1 = 5, m2 = 5.
Figure 5. Gantt chart of optimal scheduling for m1 = 5, m2 = 5.
Processes 12 00051 g005
Figure 6. Gantt chart of optimal scheduling for m1 = 7, m2 = 5.
Figure 6. Gantt chart of optimal scheduling for m1 = 7, m2 = 5.
Processes 12 00051 g006
Figure 7. Comparison chart of action selection frequency.
Figure 7. Comparison chart of action selection frequency.
Processes 12 00051 g007
Table 1. The initial parameter level table.
Table 1. The initial parameter level table.
αγελ
K10.0010.10.010.1
K20.10.90.10.5
K30.90.990.20.9
Table 2. Orthogonal experimental results.
Table 2. Orthogonal experimental results.
No.αγελCmaxTECNP
10.0010.10.010.127615481.84
20.0010.90.10.526813960.56
30.0010.990.20.926813960.56
40.10.10.10.926615020.87
50.10.90.20.126413800.18
60.10.990.010.527115661.54
70.90.10.20.526315380.80
80.90.90.010.927615842.00
90.90.990.10.126913560.46
Table 3. NP values of parameters.
Table 3. NP values of parameters.
αγελ
K12.96 3.51 5.38 2.49
K22.592.74 1.89 2.89
K33.26 2.561.543.43
optimal0.10.990.20.1
Note: The bolded values in the table represent the best results.
Table 4. The calculation results of objective function values and CPU time for small-scale instances.
Table 4. The calculation results of objective function values and CPU time for small-scale instances.
JobGurobiAQL
CmaxTECT/sCmaxTECT/s
n = 4973540.781183540.57
n = 51105405.221224401.74
n = 6122560601294741.78
n = 714564411951495422.60
n = 818001706963.62
Note: The bolded values in the table represent the best results.
Table 5. Scheduling solution and object values.
Table 5. Scheduling solution and object values.
MachineScheduling SolutionCmaxTEC
m1 = 3, m2 = 5[26, 10, 19, 17, 12, 30, 3, 22, 4, 8, 29], [25, 23, 5, 27, 21, 20, 7, 13, 9, 18], [2, 15, 16, 24, 11, 6, 28, 14, 1]
[26, 27, 21, 20, 7, 22, 9, 18], [25, 15, 24, 30, 3, 4, 8, 29], [2, 17, 6, 28, 1], [23, 5, 19, 12, 13], [10, 16, 11, 14]
3252218
m1 = 5, m2 = 5[23, 5, 27, 24, 13, 29], [26, 6, 11, 17, 30, 14, 9], [25, 12, 21, 28, 8], [2, 15, 16, 7, 1], [10, 19, 22, 20, 3, 4, 18]
[26, 15, 16, 7, 1, 18], [23, 5, 6, 22, 21, 3, 8], [25, 19, 20, 13], [10, 12, 27, 30, 28, 14, 9], [2, 11, 17, 24, 4, 29]
2472838
m1 = 7, m2 = 5[26, 23, 16, 7, 4], [25, 5, 29, 20, 8], [2, 3, 17, 28], [10, 19, 21, 13], [11, 9, 12, 18], [6, 27, 30, 1], [22, 15, 24, 14]
[26, 23, 19, 12, 30], [25, 6, 3, 17, 7, 18, 1], [10, 27, 24, 14], [2, 22, 9, 15, 20, 28, 13], [11, 5, 29, 16, 21, 8, 4]
2373515
Table 6. Comparison of objective functions of different algorithms for m1 = 3, m2 = 5.
Table 6. Comparison of objective functions of different algorithms for m1 = 3, m2 = 5.
No.Rulen = 15n = 30n = 50n = 100
CmaxTECCmaxTECCmaxTECCmaxTEC
R1SPT-SPT231 *184338834616155551123611,585
R2SPT-LPT231 *184337435846295379119611,402
R3SPT-FCFS231 *184338834616295379120711,789 *
R4LPT-SPT2131791390 *3380656 *6205118510,028
R5LPT-LPT214192337632896505520123511,638
R6LPT-FCFS2131791384361765162121250 *11,557
R7SPT + SSO-SPT214169636323985865962118111,277
R8SPT + SSO-LPT214169636323985775647117810,478
R9SPT + SSO-FCFS214169636323985865962117810,466
R10LPT + LSO-SPT1962097 *35629795535611116511,329
R11LPT + LSO-LPT1962097 *34434025745780114110,670
R12LPT + LSO-FCFS1962097 *35629795575701113211,029
R13Johnson-SPT19417803643653 *5755730118610,041
R14Johnson-LPT19417803643653 *5916418118311,526
R15Johnson-FCFS19417803643653 *5776492 *120411,083
R16AQL1599543252218511405910988483
Note: The bolded values in the table represent the best results, and the values marked with (*) represent the worst results.
Table 7. Comparison of objective functions of different algorithms for m1 = 5, m2 = 5.
Table 7. Comparison of objective functions of different algorithms for m1 = 5, m2 = 5.
No.Rulen = 15n = 30n = 50n = 100
CmaxTECCmaxTECCmaxTECCmaxTEC
R1SPT-SPT222224831743675068006101416,707
R2SPT-LPT222224828642155428552102617,178
R3SPT-FCFS22222483174367541833299615,915
R4LPT-SPT220283732341245538987102017,465
R5LPT-LPT2162884327 *4767 *5478913103117,606
R6LPT-FCFS224 *3364 *3224648567 *90781042 *17,655
R7SPT + SSO-SPT180193030939525147996102516,060
R8SPT + SSO-LPT18019302833291525863497015,731
R9SPT + SSO-FCFS180193030939525268392101815,741
R10LPT + LSO-SPT186266531142085138758101817,777
R11LPT + LSO-LPT186266530342765238900100016,819
R12LPT + LSO-FCFS186266528738945209389 *99417,155
R13Johnson-SPT168245728340434708022102618,701 *
R14Johnson-LPT168245729144964919242102618,555
R15Johnson-FCFS168245729139574919242100017,756
R16AQL15116532472838440732793614,899
Note: The bolded values in the table represent the best results, and the values marked with (*) represent the worst results.
Table 8. Comparison of objective functions of different algorithms for m1 = 7, m2 = 5.
Table 8. Comparison of objective functions of different algorithms for m1 = 7, m2 = 5.
No.Rulen = 15n = 30n = 50n = 100
CmaxTECCmaxTECCmaxTECCmaxTEC
R1SPT-SPT218 *3119280562449610,688103223,859
R2SPT-LPT218 *3119261546448610,571101424,350
R3SPT-FCFS218 *3119280562451011,360100522,438
R4LPT-SPT21235013075256559 *12,694 *1077 *25,128
R5LPT-LPT2123458307591455111,466104823,694
R6LPT-FCFS2123501313573755111,466106723,831
R7SPT + SSO-SPT179238225839454759553106124,597
R8SPT + SSO-LPT2022689296521749910,796104825,348
R9SPT + SSO-FCFS202268926041554578329102523,356
R10LPT + LSO-SPT1903729312651252112,498104926,380 *
R11LPT + LSO-LPT1903729327 *708751011,741107525,628
R12LPT + LSO-FCFS1903729327 *7601 *51011,741105025,382
R13Johnson-SPT1903803271519248110,031100825,046
R14Johnson-LPT2013905 *286570749211,56999422,842
R15Johnson-FCFS2013905 *271519248211,59399122,391
R16AQL17719372373515455923887219,876
Note: The bolded values in the table represent the best results, and the values marked with (*) represent the worst results.
Table 9. Comparison of objective functions for different multi-objective algorithms.
Table 9. Comparison of objective functions for different multi-objective algorithms.
No.JobMachineNSGA-IIQ-LearningAQL
CmaxTECCmaxTECCmaxTEC
R1n = 15m1 = 3, m2 = 522113301611221159954
R2n = 15m1 = 5, m2 = 5165207015116611511653
R3n = 15m1 = 7, m2 = 5149200016821151771937
R4n = 30m1 = 3, m2 = 5410284632123593252218
R5n = 30m1 = 5, m2 = 5254311727132152472838
R6n = 30m1 = 7, m2 = 5239354724338952373515
R7n = 50m1 = 3, m2 = 5704450354745275114059
R8n = 50m1 = 5, m2 = 5445741346674034407327
R9n = 50m1 = 7, m2 = 545610,18747598064559238
R10n = 100m1 = 3, m2 = 5135695861132945710988483
R11n = 100m1 = 5, m2 = 598315,03797314,94593614,899
R12n = 100m1 = 7, m2 = 591320,22394520,47287219,876
Note: The bolded values in the table represent the best results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, K.; Ye, C.; Gong, H.; Sun, W. Reinforcement Learning-Based Multi-Objective of Two-Stage Blocking Hybrid Flow Shop Scheduling Problem. Processes 2024, 12, 51. https://doi.org/10.3390/pr12010051

AMA Style

Xu K, Ye C, Gong H, Sun W. Reinforcement Learning-Based Multi-Objective of Two-Stage Blocking Hybrid Flow Shop Scheduling Problem. Processes. 2024; 12(1):51. https://doi.org/10.3390/pr12010051

Chicago/Turabian Style

Xu, Ke, Caixia Ye, Hua Gong, and Wenjuan Sun. 2024. "Reinforcement Learning-Based Multi-Objective of Two-Stage Blocking Hybrid Flow Shop Scheduling Problem" Processes 12, no. 1: 51. https://doi.org/10.3390/pr12010051

APA Style

Xu, K., Ye, C., Gong, H., & Sun, W. (2024). Reinforcement Learning-Based Multi-Objective of Two-Stage Blocking Hybrid Flow Shop Scheduling Problem. Processes, 12(1), 51. https://doi.org/10.3390/pr12010051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop