1. Introduction
The flow shop scheduling problem (FSSP) plays an important role in manufacturing systems. Optimizing multiple criteria of the FSSP can help reduce manufacturing costs and improve the manufacturing efficiency of enterprises. In recent years, many variants of the FSSP have emerged and many methods have been proposed to optimize some of its criteria. The permutation flow shop scheduling problem (PFSP) is a classical form of the FSSP, which was first introduced and formulated by Johnson [
1]. The problem with more than three machines is shown to be NP-hard [
2]. The goal of this problem is to schedule operations on the machines to optimize one or more performance criteria, such as minimizing the makespan, mean tardiness, total late work, and total flow time of all jobs. A rational scheduling algorithm can not only improve the efficiency and performance of the system but also reduce the cost of machines. In the past decades, various exact or heuristic algorithms for solving scheduling problems have been suggested [
3]. Researchers have generalized the PFSP problem into multiple variants to simulate real production scenarios, such as no-wait flow shop, blocking flow shop, no-idle flow shop, and energy-efficient flow shop.
To solve problems of different scales, many effective exact methods, heuristics, and meta-heuristic algorithms have been proposed to optimize various criteria for the PFSP. Exact methods based on enumeration with an integer programming formulation are usually employed to find the optimal solution for PFSP. When dealing with large-scale problems, the solution space gets bigger and the use of exact algorithms will always lead to the combinatorial explosion so that the computation time is usually unacceptable. The most commonly applied approaches for large-scale problems are heuristic approaches such as NEH [
4] and CDS [
5], which are capable of real-time decision making. However, handcrafted heuristic algorithms only consider limited information which leads to the unstable performance of the algorithm. Meta-heuristic algorithms such as the genetic algorithm (GA) [
6], tabu search algorithm (TS) [
7] and the particle swarm optimization algorithm (PSO) [
8,
9] are a class of algorithmic frameworks that are problem-independent. The performance of these algorithms depends on adequate parameter tuning and initial solutions. Moreover, these algorithms have a slow convergence speed on problems that with high computational complexity. With the development of deep learning, reinforcement learning, and some high-performance search algorithms, methods applying these techniques have achieved better performance compared to traditional algorithms on some complex combinatorial optimization problems.
The late work appears to be important in production planning from the perspective of the customer as well as the manager. Customers are often concerned about the completion of the order, because those tasks that are not completed by the due dates need to be processed additionally. Managers are also interested in minimizing the late work criteria, as delays may cause financial losses. The late work phrase was first introduced in literature [
10] and is defined as the number of tardy job units. The symbol
is first used to represent it. Blazewicz et al. describe the difference between late work and other performance criteria such as makespan, tardiness, lateness, and proves that problems with a late work objective function are at least as difficult as problems with a maximum delay criterion [
11]. Since then, many exact and heuristic algorithms have been proposed for single machine [
12], parallel machine [
13], and dedicated machine problems [
14] with late work objective functions. The problem of minimizing late work has been demonstrated in many fields such as chip manufacturing [
15], computer integrated manufacturing (CIM) [
14], supply chain management [
16], and software development processes [
17]. The flow shop problem with the objective of minimizing late work was first proposed by [
18] and solved using a genetic algorithm. Gerstl et al. studied the properties of the problem and extended it to the proportionate shop problem [
19], but no high-performance optimization methods for late work have been proposed in recent years.
The FSSP is a branch of combinatorial optimization problems. Many studies applying reinforcement learning methods to solve combinatorial optimization problems have appeared in recent years. Many combinatorial optimization problems can be transformed into multi-stage decision-making problems, which means a sequence of decisions need to be performed to maximize/minimize the objective function. Therefore, some researchers have proposed agents suitable for these problems based on RL [
20,
21,
22]. These methods can reach a level beyond existing algorithms. The deep reinforcement learning (DRL) methods in these problems can basically be divided into two categories, one that constructs the solution in an end-to-end way, and the other that improves on existing feasible solutions.
In construction-based methods, many studies [
20,
21,
23] are based on the pointer network (PtrNet) [
24]. The PtrNet solves the problems of variable size output dictionaries based on a sequence-to-sequence model. In literature [
20], the PtrNet was trained using the Actor-Critic algorithm to obtain the distribution over all nodes to solve problems such as a knapsack problem. In literature [
25], a simplified version of PtrNet is presented that is capable of handling routing problems in both static and dynamic environments. At each time step, the embedding of static elements as input to the RNN decoder, the output of the RNN, and the embedding of dynamic elements as output of the attention mechanism, the decision is made from the distribution on available destinations.
Different from construction heuristics, some studies [
26,
27,
28] focused on improvements to existing solutions. Chen et al. designed a model called NeuRewriter and applied it to several domains [
26]. First, a region is selected by a region selection policy, then a rewrite rule is obtained by a region rewrite policy, a local solution is used instead of the original solution to obtain an improved solution, and the process is repeated until convergence. Another work [
28] used only one policy network to select a solution within a neighborhood structured by pairwise local operators, surpassing [
26] in both solution quality and generalization capability on the routing problems.
Most of the approaches used graph-independent sequence-to-sequence mapping and did not make full use of the graph structure of graph-based problems. In order to make full use of graph structure, graph embedding and the graph neural network (GNN) have been introduced to solve graph-based combinatorial optimization problems [
29,
30], which can take into account nodes, edges, and their accompanying labels, attributes, text, and other information in a network, enabling better use of the network structure for modeling and reasoning.
The traditional methods mentioned above for solving the PFSP are difficult for achieving a trade-off between computation time and solution quality. In shop scheduling problems and even combinatorial optimization problems, the optimization objective is generally to minimize some criteria such as total cost, total completion time, distance traveled, and delayed work. The smaller the value of these criteria, the better the quality of the current solution. Construction-based reinforcement learning methods are able to obtain better solutions in a short time, but the quality of the solutions generally cannot exceed that of the meta-heuristic methods, while improvement-based methods require artificial extraction of features and are difficult to train. This article innovatively proposes an end-to-end reinforcement learning method and an improved iterative greedy method to minimize the late work of PFSP. The contribution of this paper mainly includes the following three aspects.
- (1)
The proposed approach generates high-quality solutions using an end-to-end architecture based on reinforcement learning. The models can be trained without expert knowledge and labeled data, and the trained models can automatically extract features from the problem.
- (2)
The PFSP is innovatively regarded as a complete graph. Two multi-layer GIN are used to encode the constraint features and processing time features in the PFSP. The GINs are able to efficiently aggregate the nodes’ own features and other neighbors’ features to obtain a contextual representation of each node.
- (3)
An improved iterative greedy method is proposed. The RL model is able to obtain high-quality initial solutions in a short time and the IG method is used to improve the initial solutions. Experimental results show that the RL + IG method surpasses many excellent heuristic and meta-heuristic algorithms.
The rest of this article is organized as follows.
Section 2 first describes the PFSP of minimizing the late work objective and models it as a sequential decision process, then describes the deep reinforcement learning architecture used and the training method of the model to generate an initial solution to the problem.
Section 3 proposes a hybrid iterative greedy algorithm to further improve the generated initial solution.
Section 4 illustrates the experimental setup of this article and shows the results of comparing the proposed algorithm with other methods.
Section 5 concludes the article and presents several directions for future work.
3. A Hybrid Iterated Greedy Method to Improve the Initial Solutions
The iterative greedy (IG) methods are mainly used to solve the flowshop problem of minimizing makespan, which was first proposed in [
41] and is the most effective meta-heuristic method so far. The algorithm first uses a heuristic method (such as NEH) to generate an initial solution
, for which the initial solution is improved using a local search algorithm with the insertion neighborhood algorithm. Then,
jobs are removed from this sequence (destruction phase) and reinserted one after another to the position that minimizes the objective function value obtained after insertion (construction phase), looping this process to obtain a new sequence
. Finally, the sequence
is obtained using a local search method based on the insertion neighborhood of
, using a probability similar to simulated annealing to decide whether to accept
and start the next iteration. The current well-known variant of the IG algorithm [
42] applies a local search to both complete and partial solutions to speed up the search, applying the NEH algorithm with local search to produce an initial solution, the IG framework is shown in Algorithm 2.
Algorithm2 The IG framework with local search |
input: a PFSP problem instance P, the number of jobs removed d, simulated annealing parameter T. output: The best solution found . ; //replaced by ; ; while termination criteria not met do Randomly remove d jobs from Π (destruction); (Let ΠR be the remaining sequence and ΠD be the extracted jobs); ; ; ; ; if then ; return
|
The proposed deep reinforcement learning architecture (DRL for short) can replace the combination of NEH and the insertion neighborhood local search as a high-performance initial solution generator. In the destruction and construction phase, an adaptive local search strategy is proposed, which consists of an insertion operator (LS1), a swap operator (LS2). The weights of the three operators are
and
, respectively, and the corresponding probabilities are
,
. The weights of the two operators are initialized to 1, and the probabilities are initialized to
. As the number of iterations grows, the probabilities of the operators that can improve the current solution more are increased. In order to improve the effectiveness of the local search, the tie-breaking mechanism is added. If the late work values obtained from multiple positions inserted or swapped by the operator are the same, then the one with the smallest idle time in the scheduling sequences obtained after the execution of the operators is taken as the final result. The
of the local search is set to half the number of jobs to improve the efficiency of the search. The late work optimization objective is very sensitive to the permutation of the jobs, especially if the problem with a release date constraint, small changes can lead to a dramatic deterioration of the late work value. Therefore, the inverse operator is not used in the improved IG algorithm, which would be destructive to the current solution. The
,
of the next iteration are calculated by
,
. The update and calculation method of these parameters are shown in Equations (18)–(20).
where
is the fitness after local search update, and
is the fitness of the best solution (the solution before a local search) currently found. In the same iteration, only one of
and
is
, and the other is
.
is the learning factor, which is generally set to
. The pseudocode of the local search framework is shown in Algorithm 3. The operator pseudocode used in it is shown in Algorithm 4.
Algorithm3 The local search framework |
take permutation or as input , set l = 0; generate a random number ; while do: if then ; else and then ; if then if then update the weights and probabilities of each operator if necessary return
|
Algorithm 4 The process of two operators |
randomly select a location a of the input , let , , and or swap; while do if then if then insert job into the b-th position of the permutation (LS1); if then swap the in position a and the πb in position b (LS2); a candidate solution cand is obtained; if then ; if then if the tie-breaking conditions are met then ; ; return
|
This method uses the same acceptance criterion as the
method, which applies the idea of simulated annealing to decide whether to accept the candidate solution
obtained from each iteration. If the candidate solution
is better than or equal to the current best solution
, the
is directly replaced with
. If
is worse, the decision to accept the candidate solution is made with a probability given by
where
is calculated as Equation (21) and
is a hyperparameter that can be adjusted.