1. Introduction
In smart cities, intelligent transportation systems (ITSs) are highly anticipated to improve transportation efficiency and promote sustainable transportation development [
1]. For port cities, the rapid growth in container logistics has brought great challenges to the throughput and operation efficiency of container terminals. With the advancement of the Internet of Things (IoT), Automated Container Terminals (ACTs) have emerged as a popular way to address the aforementioned issue. ACTs have not only become a research hot topic in academia [
2] but have also been implemented in several ports, such as Qingdao Port and Yangshan port in Shanghai. As the starting point of the loading operation and the end point of the unloading operation, the storage yard is critical to the handling efficiency of ACTs [
3]. Automated Stacking Cranes (ASCs) are the most important equipment in the yard. Therefore, effective ASC scheduling is of great importance for improving the terminal operational efficiency. Meanwhile, energy consumption can be reduced by optimizing the run time of ASCs.
An automated storage yard often consists of several blocks. Each block contains a number of bays for container storage and is equipped with two ASCs (called twin ASCs) to move containers. In this paper, we consider the scheduling of twin ASCs in one block. Most existing works consider twin ASCs scheduling as a static problem. However, in reality, the containers often arrive dynamically due to the uncertain arrival of vessels and the scheduling of previous operations before storage in the yard, such as the unloading sequence of quay cranes (QCs) [
4] and dispatching of Automated Guided Vehicles (AGVs) [
5]. Therefore, we consider a more realistic twin ASCs scheduling problem with dynamic container arrival, which brings the challenge of uncertainty. In such cases, the limited information with regard to containers is not enough for Mathematical Programming (MP) to construct the entire set of constraints, and the incomplete task information cannot support meta-heuristics in order to explore the entire operation process. In addition, considering the NP-hardness of the problem, none of the previous methods can effectively handle a dynamic environment, which requires a real-time response [
6]. In reality, such dynamic scheduling problems are often solved by rule-based heuristics. However, existing scheduling rules are mostly designed based on human intuition; hence, the quality of solutions is severely limited by their greedy nature and the lack of comprehensive observation of system information.
Essentially, the dynamic twin ASCs scheduling problem can be considered as a variant of the Dynamic Job-shop Scheduling Problem (DJSP). The container handling tasks and ASCs can be regarded as the jobs and machines, respectively. The dynamic attribute is the dynamic arrival of import containers. Recently, motivated by its impressive advancements in Reinforcement Learning (RL) [
7,
8] and Deep Neural Networks (DNN) [
9], DRL has been employed to solve NP-hard combinatorial optimization problems [
10] such as the JSP and the traveling salesman problem (TSP) [
11]. In DRL, the complex computation in traditional methods is displaced by elaborated DNN, which endows the agent with high-quality solutions and millisecond inference speed. Therefore, DRL has great potential in tackling dynamic scheduling problems. However, the dynamic twin ASCs scheduling problem is distinguished from standard JSP by its unique characteristics that cannot be handled by existing DRL-based methods. Firstly, before handling a container, the ASC must move to the position of the container from the end point of its previous handling. The time for such movement (known as setup time) depends on the handling sequence. Hence, location-related information is critical in minimizing the run time of ASCs. Secondly, the dynamic arrival of import containers could cause severe congestion in ACT. Hence, it is important to capture the congestion-related information by analyzing the sequential information of dynamic containers. As a result, the agent needs to learn the location-related information and congestion-related information—which have different semantics and require different neural structures—simultaneously. Such multi-aspect learning is challenging for existing DRL-based scheduling methods.
In this paper, we propose an end-to-end DRL to automatically learn effective heuristics for solving the dynamic twin ASCs scheduling problem. Compared with rule-based heuristics, the learned policies have noticeably better scheduling performance with similar time costs. Specifically, we first propose a Markov Decision Process (MDP) model to simulate the dynamic handling process of a block with twin ASCs. Based on the MDP model, a novel neural structure is proposed to map the observation to a scheduling policy. In particular, we propose the dynamic masked self-attention (DMA) mechanism to grasp location-related information between containers while dealing with information interference caused by ineligible containers. Further, we propose local information complementary attention (LICA) to extract the congestion-related information of import containers and fuse it with the location-related information. Finally, we design a size-agnostic policy network to calculate the distribution over action spaces of variable size. Benefiting from the size-agnostic property, the trained agent can be generalized to handle problems of varying scales (i.e., the number of containers). We train the proposed network using Proximal Policy Optimization (PPO) [
12]. Extensive experiments show that the learned policies perform remarkably well on problems of various scales. Moreover, the learned policies exhibit strong generalization performance on instances with various scales or different distributions.
The main contributions of this study are listed as follows:
The DRL method is introduced to the dynamic twin ASCs scheduling problem for the first time. The millisecond decision time, outstanding scheduling performance and robust generalization capability are extremely promising for dynamic scheduling in real scenarios.
We propose a novel MDP model for the dynamic scheduling problem, which comprehensively simulates the dynamic handling process in the container block.
We propose a novel attention mechanism, DMA, which can automatically ignore information interference and grasp pairwise location relationships between containers.
We propose a novel attention-based framework, LICA, which complements the container feature embeddings extracted by DMA with congestion-related information.
The remainder of this paper proceeds as follows:
Section 2 summarizes relevant works. The research problem is formally described in
Section 3.
Section 4 describes the proposed neural structure and DRL method in detail. The comparative experiment results are presented and discussed in
Section 5. Lastly,
Section 6 concludes the paper.
3. Problem Description
In this section, we elaborate on the dynamic twin ASCs scheduling problem with the dynamic arrival of import containers.
In ACT, the automated yard consists of a number of blocks perpendicular to the quay. We consider the scheduling of twin ASCs for one block, the layout of which is demonstrated in
Figure 1. The block consists of a number of bays
, which are numbered in order from quay to land. Bay 0 and
are called the seaside IO and landside IO, respectively, and can be regarded as a buffer with a certain capacity. The seaside IO is responsible for the container shift between the block and the AGV, and the landside IO is employed to shift containers between the block and external trucks. Bays 1, …,
constitute the space for storing and stacking containers and are called the storage area. There are two ASCs, called twin ASCs, for container handling in the block. In a container handling process, the travel distance of ASCs in the row direction is generally much smaller than that in the bay direction. Furthermore, the motion in both directions can be performed simultaneously. Without loss of generality, we only consider the bay-side movement of the ASC. Twin ASCs are more efficient with a balanced workload on the seaside and landside [
34]. Consequently, we set the bay that is the average of all container handling midpoints as the handshake area to promote workload balance. Seaside ASC can transport containers between the seaside IO and the handshake area. The work range of the landside ASC is situated between the handshake area and the landside IO. Since the two ASCs share a common track, they cannot overlap or cross each other. For ease of description, we only consider the interference situation with non-crossing and non-overlapping constraints. Note that safety distance constraints can be incorporated into the presented methods [
18,
19].
In actual ACT, to facilitate import/export management, container transportation between blocks and vessels and between blocks and external trucks cannot be performed simultaneously [
35]. Here, we only consider the container movement between blocks and vessels. When a vessel enters the terminal, a group of import containers needs to be transported by AGVs from the quay to the seaside IO for storage. Considering its limited capacity, if the seaside IO is already full, the AGV with import containers must wait for other containers in the seaside IO to be removed before unloading the containers. Simultaneously a group of export containers need to be transported from the seaside IO to the quay by empty AGVs.
Figure 1 gives an example wherein the import container (green) needs to be moved into the block and the export container (red) needs to be moved to the seaside IO.
The container handling process is performed over a finite time horizon
. The set of all containers is denoted as
, where
and
are the set of import and export containers, respectively, and the import containers in
transported by AGVs arrive dynamically at the seaside IO. For the convenience of formulation, the arrival of imported containers is defined as a Poisson process. Without loss of generality, we denote a container
as a tuple
, which comprises its arrival time
, start position
, and destination position
. Specifically, for import containers
,
is the arrival time of the AGV with container
c,
, and
. Benefiting from the stability of AGV transport, once the container is loaded onto the AGV by the quay crane, the arrival time to the seaside IO can be expected (i.e., the arrival time
is observable in a time window before
) [
4,
5]. Export containers
are located in the block at the beginning, i.e.,
,
, and
.
As mentioned above, AGVs with import containers may wait at the seaside IO due to its congestion. This waiting time is a variable determined by the twin ASCs scheduling results, which directly affects the seaside IO occupancy. As a result, any import container can only be handled by the ASC until it is unloaded in the seaside IO; thus, its earliest accessible time is . Until it is moved out the seaside IO by an ASC, it will occupy a position within the seaside IO. Since export containers do not need to wait, we have . However, the export containers need to be handled to the seaside IO and remain there until there is an empty AGV to transport them away. Additionally, it is assumed that the arrival of empty AGVs also follows a Poisson process. Once the import container is handled into the seaside IO, the container will be removed at the earliest feasible time indicated in the arrival time list of empty AGVs.
For a container
c, if
and
are on the same side of the handshake area (e.g., the green one in
Figure 1), it needs to be handled once by the seaside ASC. Otherwise, it needs to be handled twice by different ASCs. The target position of the first handling is the handshake area, and the target position of the second handling is container’s destination position
(e.g., the handling process of the red container in
Figure 1). Here, we denote the seaside and landside ASC as
and
, respectively, and model each handling as an operation
, where
, with
and
being the operations of
and
, respectively. The operation to each container
c cannot start before its earliest accessible time
and is non-preemptive (i.e., it must be handled to the target position of the operation once started).
The processing time of each operation
i consists of three parts: (1) fixed handling time, (2) picking up and dropping down time of ASC and (3) waiting time caused by possible ASC interference. The fixed handling time is the time of the ASC moving from its start position
to its target position
. From the above, the processing time
can be formulated as:
where
is the velocity of the ASC, and
is the picking up and dropping down time of the ASC.
Before handling an operation
i, the ASC must move from its current position (i.e., the target position
of the previous operation
j for this ASC) to the location of the next container (i.e., the start position
of
i). This movement can be regarded as the setup before operation, and the corresponding setup time
can be formulated as:
where
j indexes the previous operation of
i on the same ASC. At the beginning, the seaside ASC and the landside ASC are located at the seaside IO and the landside IO, respectively. Clearly, the setup time depends on the scheduling of the handling sequence.
During handling of operation i, if both ASCs need to pass through the handshake area at the same time, one ASC must wait outside (determined by a priority rule) to avoid ASC interference. The time of such waiting, denoted as , is a variable related to the scheduling decisions.
In this article, we attempt to improve the service quality and economic benefits by minimizing the congestion and operation cost. To this end, we adopt the integrated objective in [
20]:
where
and
are the total waiting times of AGVs with import containers and the cumulative run time of ASCs to complete all operations, respectively. They represent the congestion of AGVs in the block and the operating costs of ASCs, respectively. Concretely, they are formulated as:
Remark 1. Abstractly, the import containers can be regarded as dynamically arriving jobs, while the export containers are static. One ASC (machine) can only process one job at one time and must move to the position of the next container before handling it (setup). In addition, the ASC must comply with the non-overlap constraint during all handling processes. To sum up, the scheduling problem studied in this article can be viewed as a DJSP problem with a sequence-dependent setup time and machine interference constraints.
4. Methodology
In this section, we introduce our method in detail. Firstly, the MDP model is established, followed by the definition of raw state features. Then, our neural architecture is introduced in detail, including the two novel design: DMA and LICA. Finally, the actor–critic-based training algorithm is presented.
4.1. Markov Decision Process
In this part, we formulate an event-driven MDP model to simulate the dynamic twin ASCs scheduling problem.
4.1.1. State
The decisions are made at the initial time or when the ASC completes a handling operation. If there is no eligible container to handle for the current decision step, the environment transits to the next decision step. The ASC to be scheduled at t is denoted as . The state at the decision step t, denoted as , contains information about containers and the block at t, such as the destination bay of import containers, the arrival time of observable import containers, the origin position of export containers, the eligibility of all containers, the current position of ASCs, and the occupancy status of the seaside IO. Note that the eligible containers are the containers that can be handled by the ASC scheduled at step t. Specifically, these are containers that have not been handled to their destination and whose current operations are located in the scheduled ASC’s work range. Here, is the ASC that just completed its handling operation.
4.1.2. Action
An action at step t is to select an eligible container for , i.e., select a job for the machine. The action space is the set of eligible containers. As the handling progresses, the eligible actions for each step t are different, both the manual rules and the DRL techniques must eliminate ineligible containers when choosing actions in order to ensure the validity of the action.
4.1.3. Transition
Once the action is taken, the selected container will be transported to the target bay, and the state will transit to the next state . In this process, the scheduled ASC moves to the current location of the selected container and transports the container to its target bay (i.e., the destination bay or the handshake area). Considering the possible ASC interference during the state transition, we apply the First-In–First-Out (FIFO) rule to prioritize ASCs so as to satisfy the non-overlap constraints. This enables resolving possible interference at a negligible cost.
4.1.4. Reward
To minimize the objective (
3), the reward function
is designed as the negative increment value of AGV waiting time and work time for ASCs after taking action
at step
t. The mathematical expression is as follows:
where
is the total waiting time of AGVs with import containers at step
t, and
is the accumulated run time of twin ASCs at step
t.
4.1.5. Policy
For the state , the stochastic policy is a probability distribution over the action space . Due to the dynamic nature of the investigated problem, we use the greedy strategy to select the action with the highest probability. For agent training, however, we sample actions according to for the purpose of exploration.
4.2. Raw Features
To comprehensively represent the state , we design a set of raw features to describe the states of all containers and resources (ASCs) in the block. For each container , we record the following information as raw features:
The current position of c, which is an integer specifying the index of its current bay;
The destination of c, which is an integer specifying the index of its destination bay;
The distance between container c and the current position of ;
The eligibility of c, which is a binary value specifying if c is accessible for the scheduled ASC at t (0) or not (1);
The number of import containers in the seaside IO, which is an integer representing the occupancy of the seaside IO by imported containers;
The number of export containers in the seaside IO, which is an integer representing the occupancy of the seaside IO by exported containers;
The difference between the arrival time of c and the current time: for special cases, it is set to 0 if c has already arrived, or a large number if its arrival time is out of the observation time window.
Here, features (1)–(4) are location-related features, and features (5)–(7) are congestion-related features. Note that feature (7) is exclusive to import containers.
4.3. Feature Extraction Network
While the above features are informative, efficient feature extraction from complex redundant raw features is extremely challenging for existing neural network structures. Firstly, the location relationship between the containers and ASCs is pair-wise information, which is subject to the information interference of ineligible containers. Secondly, the effective fusion of the extracted congestion-related embeddings and location-related embeddings is challenging for existing methods. To tackle the above problems, two novel neural architectures, DMA and LICA, are elaborated as follows.
4.3.1. DMA
According to the problem description in
Section 3, the characteristics of the ASC container handling process can be summarized as follows: Firstly, ASCs handle the containers in a one-dimensional space (i.e., travel between bays). Secondly, before handling the subsequent container, the ASCs must move from the location of their previous operation to the location of the subsequent container (i.e., setup). It is obvious that the pairwise position relationship between containers and ASCs is essential to optimize the run time of ASCs. Driven by this intuition, we attempt to introduce a self-attention mechanism [
36] into the feature extraction network to grasp the pairwise location relationships. A self-attention mechanism has been employed to solve some combinatorial optimization problems, such as routing problems [
37,
38,
39,
40]. However, most of these techniques follow the encoder–decoder structure [
41], which is not applicable for dealing with complex features that change constantly [
42]. Therefore, we adopt a step-wise decision framework: i.e., the network extracts embeddings and makes decisions at each step instead of using a decoder. Additionally, the dynamic nature of the research problem makes the extraction of embeddings subject to the interference of redundant information (i.e., the features of ineligible containers). As will be shown in the experiments, directly applying the attention mechanism results in poor performance. To tackle this issue, we propose a novel attention mechanism, DMA, to eliminate the interference of redundant information. The details are introduced in combination with
Figure 2 as follows.
In order to unify the model input dimensions, the location-related features are projected to a fixed dimension
as the input feature matrix
X. Subsequently, Query (
Q), Key (
K) and Value (
V) can be obtained by linear projections:
where
,
and
are trainable parameters, and
in our method.
Based on matrices
Q and
K, the compatibility matrix
can be calculated by
where
is the number of containers, and
is the scaling factor to regulate the attention weight after softmax.
In DMA, the compatibility matrix
M is masked by a dynamic mask to eliminate the attention to the information of ineligible containers. Specifically, for each column in the compatibility matrix, the elements are identified according to the eligibility of containers at current state
and are masked out from the softmax by overwriting their score as follows:
where
is the element in row
i and column
j of the compatibility matrix
M.
Then, attention can be calculated by row-wise softmax, and the output can be obtained by the weighted summation of value vectors in
V. The above process can be written as:
Inspired by multihead self-attention (MHA) [
36], we propose multihead DMA to capture more features from different high-dimensional spaces. As shown in
Figure 2,
M DMA modules, which do not share parameters, are performed in parallel. Then, we fuse all the outputs by concatenation and project this back to the
dimensional space by a fully connected layer. Finally, the output embeddings of dimension
are obtained by skip-connection and layer normalization. The above process can be formulated as follows:
where
are the outputs of MDA modules,
is the horizontal concatenation operator,
is a fully connected layer, and
is the layer normalization operator. In our method, we stack
multihead DMA layers to obtain the location-related embeddings
so as to balance performance and computational efficiency.
Remark 2. Compared with the standard attention mechanism, the DMA can automatically eliminate the interference of redundant information (i.e., the features of ineligible containers). Benefiting from the multihead DMA, the feature extraction network can effectively learn to grasp the pairwise location-related relationships between containers, which is significant to minimize the run time of ASCs. Because the interference of redundant information is circumvented, the agent can converge to a better result in the training process. Furthermore, the multihead DMA has a size-invariant property with regard to the number of containers, which enables the network to work on problems of different scales using the same parameters.
4.3.2. LICA
On the other hand, the congestion-related information is critical to optimizing the total waiting time of AGVs with import containers. Due to the different characteristics between location-related information and congestion-related information, utilizing one neural architecture to extract the embeddings of both kinds of information cannot achieve satisfactory performance. Considering the sequential property of congestion-related information, Long Short-Term Memory (LSTM) [
43] is employed to extract the embeddings from congestion-related features. Feature (7) is exclusive to imported containers, and the congestion-relate information mainly affects the priority of import containers. Therefore, the congestion-related embeddings
are only captured for import containers, where
is the number of import containers. In order to effectively fuse two types of information with different semantics, we design a novel neural framework, LICA, of which the inputs are location-related embeddings captured by DMA and congestion-related embeddings grasped by LSTM. The details are as follows.
As illustrated in
Figure 3, the location-related embeddings
obtained by DMA (i.e., the blue one in
Figure 3) and congestion-related embeddings
calculated by LSTM (i.e., the red one in
Figure 3) have different dimensions. In LICA, the location-related embeddings are split into two pieces: location-related embeddings of import containers
and location-related embeddings of export containers
, where
is the number of export containers. Then, the location-related embeddings of import containers and the congestion-related embeddings are fused by a convolution fusion module. The first step is stacking them into a tensor
. Then,
is transformed into a new matrix by multiple point-wise convolution layers as the fusion embedding of import containers:
where
, and
is an operator representing multiple point-wise convolution layers and has two hidden layers with dimensions of 16 and 4, respectively.
Following the convolution fusion, the new embeddings of import containers
include the location-related information and congestion-related information. To fuse the embeddings of import containers and export containers effectively, a secondary fusion module based on MHA is proposed. Specifically, we concatenate
and
together, and a further fusion of them are performed by a standard MHA:
where
is the output of the LICA as well as the embeddings captured by the feature extraction network.
Remark 3. The proposed fusion framework, LICA, is capable of fusing embeddings with different dimensions. Benefiting from LICA’s superior fusion capabilities, the location-related embedding is complemented by the congestion-related information. The embeddings captured by the feature extraction network can effectively represent the location-related relationship and the congestion state of the block. Moreover, the size-insensitive property is retained. Therefore, the proposed feature extraction network provides efficient information processing for decision making.
4.4. Actor–Critic Framework and Training Algorithm
Based on the feature extraction network, we construct an actor–critic framework, which consists of a policy network and a value network, and train it by PPO.
4.4.1. Policy Network
For mapping the extracted embeddings to policy
, a size-agnostic policy network is proposed in our method. The main part of this policy network is the multilayer perceptron (MLP), which consists of fully connected layers (FCs). A vector representing the priority of all containers is calculated by MLP:
where
,
,
is an MLP (activated by a rectified linear unit (ReLU)) with two hidden layers with dimensions of 64 and 32, respectively.
To guarantee the correctness of the learned policy, we mask out the ineligible containers prior to softmax. The process of calculating the policy from
can be formulated as:
Following the aforementioned process, actions with higher scores that are not masked have a higher probability of being selected. The probabilities for ineligible containers are 0 and, hence, they will never be selected.
4.4.2. Value Network
In the actor–critic framework, the calculation of the value function is needed to estimate the cumulative rewards of the current state
and to guide the training of the agent. For this requirement, we propose an MLP-based value network, which has three hidden layers with dimensions of 256, 128 and 64, respectively, and the dimension of the output is 1. The input is a vector obtained by column summation of the embedding
. The calculation of the value can be formulated as:
where
is the MLP in the value network. Its hidden layers are activated by ReLU, but its output layer is not activated.
4.4.3. Training Algorithm
Based on the aforementioned networks, an actor–critic framework is constructed. In our approach, the Proximal Policy Optimization (PPO) algorithm [
12] with Adam optimizer is employed to train the neural network. The algorithm’s details are presented in Algorithm 1. In order to boost the learning efficiency, we create
parallel simulation environments with different training sets to collect rollout data. Each simulation environment collects data every
steps (i.e., collection steps). According to the collected data, the network parameters are updated for
iterations with batch size
. After
epochs of training, the model with optimal results is selected for scheduling.
Algorithm 1: Training Algorithm |
|
5. Experiment
The experimental details and performance of the proposed method are presented and discussed in this section.
5.1. Experiment Settings
5.1.1. Simulation Environment Settings
The parameters of the simulated scheduling environment are determined according to scenarios in real-world ACT and previous research [
16,
20]. In this article, the block has 41 bays (39 bays for storage containers), which is a medium-scale block in a real ACT. The capacity of the seaside IO is five. For ease of description, the time-related parameters are set to integers according to the proportional relationship in the actual handling process. Concretely, the time of an ASC moving the distance of one bay is defined as one unit of time, and the time for the ASC to pick up or drop a container is set as
. Following most research, we consider that the ASCs travel at a constant velocity
(one bay per unit time). Considering the transportation process of the AGVs [
44] and the proportional relationship between AGV transportation time and ASC handling time in the actual ACT operation [
2], the observation time window is set as 60. For import containers out of the observation time window, the difference between the arrival time and the current time is set as a large number
.
5.1.2. Instance Generation
To provide training data for DRL and to evaluate the performance, the instances are randomly generated by a generator with four parameters: the number of containers (
N), the percentage of import containers (POI), the mathematical expectation of the interval time between AGVs with import containers (
) and the mathematical expectation of the interval time between empty AGVs (
). Here,
N is used to control the problem scale, and
N containers are separated into
import containers and
export containers according to the POI. For the import containers, their origin bays are the seaside IO (bay 0), and the destination bays are drawn from the uniform distribution
. On the contrary, the destination bays of export containers are the seaside IO, and their origin bays are drawn from the uniform distribution
. In dynamic scheduling problems, the Poisson process is generally used to model dynamic arrival [
6]. Here, the arrival time of AGVs with import containers and the arrival time of empty AGVs are generated by Poisson process with
and
, respectively.
We generate five groups of instances of different scales for training and testing by setting N to 20, 30, 40, 50 and 60, respectively denoted as C20, , C60. For training, we set the POI = 50%, = 1/26 and = 1/30. For each scenario, we generate 30,000 instances for training and an additional 1000 instances for testing.
5.1.3. Baselines
In this paper, we compare the scheduling performance of our method against a set of rule-based heuristics and two DRL baselines. The heuristics are well-known scheduling rules for dynamic JSP with sequence-dependent setup times [
45] and twin ASCs scheduling [
13]. Specifically, the rules are Random, Shortest Processing Time (SPT), Longest Processing Time (LPT), Shortest Setup Time (SST) (i.e., the nearest neighbor rule in [
13]) and Prioritize Buffer Clearing (PBC) (i.e., prioritize moving containers out of the seaside IO). To demonstrate the effectiveness of our feature extraction network design, we also compare with two feature extraction structures: one based on the standard attention mechanism (AM) as in [
37,
42] and one based on the multilayer perceptron (MLP) as in [
31,
32,
33]. Concretely, the AM model is a two-layer MHA with eight heads, and the MLP has two hidden layers, which have 64 and 128 neurons, respectively.
5.1.4. Hyperparameters and Implementation
In our method, we perform epochs of training with parallel environments. For PPO, we set the clip range as 0.2, batch size , number of update iterations and learning rate . To boost the learning efficiency on large-scale problems (C50 and C60), we adopt the “warm start” method for training. Specifically, we load the model trained on the C40 (i.e., ) instance set and train it on the corresponding scale training set for epochs.
Our method and baselines are implemented in Python. The environment and DRL algorithm are implemented based on OpenAI Gym and Stable-baselines3. The hardware we use is a machine with an Intel Core i9-10920X CPU and a single Nvidia GeForce 2080Ti GPU.
5.2. Training and Ablation Experiments
Here, we perform an ablation study to evaluate the effectiveness of the two key components of our method: i.e., DMA and LICA. We compare five models with different feature extractors, including: the MLP model (MLP), the standard self-attention model (AM), the DMA model (DMA), the LICA model with standard self-attention (LICA-AM), and the LICA model with DMA (LICA-DMA). We train the above models on the representative medium-scale problem C40 and evaluate them on the testing set. The training curves and testing performance are demonstrated in
Figure 4 and
Table 1, respectively.
As shown in
Figure 4, the MLP model with a simple neural structure fails to learn effective policies due to its poor feature extraction capability. Therefore, it is excluded from the following comparison. DMA converges faster than AM and converges to much higher cumulative rewards, showing the effectiveness of its capability in automatically eliminating interference and extracting the location-related interrelationships between containers and ASCs. The advantage of DMA over AM is also validated in
Table 1. Obviously, DMA is more competent in extracting embeddings under information interference than AM.
Through the LICA framework, the location-related embeddings are complemented by the congestion-related information obtained by LSTM. Therefore, as demonstrated in
Figure 4 and
Table 1, the framework can effectively enhance the DRL model’s performance. During training, LICA-AM converges to higher rewards than AM. Similarly, LICA-DMA performs better than DMA in training. For the evaluation performance, LICA-AM significantly improves AM by reducing
. Moreover, LICA-DMA can grasp high-quality location-related embeddings and fuse them efficiently. As a result, LICA-DMA performs the best both in training and evaluation since it can coordinate
and
more effectively.
5.3. Performance Comparison for Different Training Scales
In this subsection, we evaluate the performance of our method on different scales and compare it with other baselines. Concretely, the agents are trained on training sets of different scales (C20, …, C60) and are tested on the corresponding scale testing sets with the same distribution. Results are summarized in
Table 1 and
Table 2. The proportions of
and
in the objective on different scales are shown in
Figure 5.
As shown in
Table 1 and
Table 2, PBC is the best rule for minimizing
, and SST is the best rule for minimizing
and the integrated objective. On the smallest scale C20, the DRL method with AM obtains the minimum objective, while our method has extremely similar performance. Except on C20, LICA-DMA consistently outperforms all baselines on all problem scales, and the difference is noticeable, particularly for large-scale problems. In addition, the policies learned by our approach have the lowest standard deviation on most scales. This suggests that our method has remarkable stability.
As the problem scales up, the complexity of the problem increases, the information interference increases, and the proportion of
in the objective increases. The DRL method with AM cannot handle the aforementioned issues; thus, scheduling performance degrades significantly as the problem scale increases. Starting from C40, it cannot even outperform SST. However, our method, LICA-DMA, can effectively overcome the above issues. Consequently, our method maintains excellent scheduling performance on various problem scales. As shown in
Figure 5, our method can effectively minimize the run time of ASCs in small-scale problems. Furthermore, in medium-scale and large-scale problems, the proposed approach can minimize the integrated objective by weighing
and
. From the comparison of DRL methods in
Table 3, we can infer the superiority of our method. The scheduling performance of LICA-DMA is significantly better than AM in most scale problems. Compared with SST, the proposed DRL method yields
improvements of 1.88%, 3.55% and 1.82% for small and medium-scale issues. For large-scale problems, the agent is required to place greater emphasis on the influence of
on the integrated objective. Consequently, it needs to sacrifice some
. For traffic congestion, the proposed DRL method exhibits improvements in
when compared to the PBC rule. Specifically, the DRL method achieves enhancements of 18.77%, 31.39%, 49.18% and 35.7% for C30–C60 scales. The millisecond decision time,
, can meet the real-time scheduling requirements for decision speed. Furthermore, the increase in
with problem scale is within a reasonable range. Although the training time of LICA-DMA is longer than AM, the training time is acceptable in most circumstances due to the offline training of DRL.
5.4. Generalization Performance for Various Scales
In practice, the scale of the scheduling task (i.e., the number of containers) could be different from those in training. Therefore, robust generalization capability is very important for the practical application of scheduling methods. To verify our method’s generalization capability, the agent trained at scale C40 is generalized to schedule the dynamic twin ASCs scheduling problems ranging from C20 to C60. The average scheduling results are exhibited in
Table 4, while the improvement of the generalized agent against the best existing methods are described in
Figure 6 (left).
As the experimental results show, the generalization results of LICA-DMA exceed the best rule on most scales, but the generalization results of DRL with AM fail to do so. Compared with SST, the improvements of LICA-DMA on various scales are −4.17%, 4.04%, 11.23%, 13.70% and 14.67%. As the improvement curves of the DRL methods show, the improvement increases with the growth of the scale. This phenomenon may be caused by the fact that the inherent enhancements of methods grow more quickly than the decline brought on by scale differences. However, the generalization of AM is worthless due to its poor scheduling performance. In addition, compared with the LICA-DMA models trained with the corresponding scale instances, the gaps of the generalized LICA-DMA model on various scales are only 6.32%, 2.68%, 0.05% and 2.90%, respectively. In general, the results suggest that the policies learned by LICA-DMA have robust generalization capability for problems with different scales.
5.5. Generalization Performance for Different Distributions
In practice, the distribution of containers in the scheduling task could be different from that in training. In the dynamic twin ASCs scheduling problem, the main item of distribution is the percentage of import containers (POI). In order to validate our method’s generalization performance on different POIs, we test the policy trained on the C40 scale with POI = 50% and the testing instances with POI = {40%, 45%, 55%, 60%}. Due to the limited space, policies trained on other scales with similar performance are not discussed here. Additionally, we only compare with SST here, while other rule-based heuristics are dropped due to their relatively poor performance.
Comparing the experimental results in
Table 5, we can see that the policies learned by our method, LICA-DMA, obtained the lowest objective for all POIs, whereas the policies learned by AM cannot outperform the best-rule SST.
The improvement curves of generalization models against the best scheduling rule on varying POIs are described in
Figure 6 (right). Specifically, AM fails to exceed the best-rule SST, and the improvements of LICA-DMA for various POIs are 3.49%, 7.25%, 11.23%, 13.95% and 16.51%, respectively. The curve of AM is not sensitive to different POIs, but it does not show that the policy learned by AM model has good generalization ability due to its poor scheduling performance. As the curve of LICA-DMA shows, the improvement increases as the POI increases. This phenomenon is caused by the following reason: our approach handles blocking more effectively than SST. As the POI increases, the growth of the inherent enhancement of our method is faster than the decrease caused by the POI difference. As a result, the generalization ability of our method for different POIs is indirectly demonstrated.
5.6. Generalization for Different AGV Arrival Processes
In an actual ACT, the number of AGVs assigned to each loading/unloading operation is varied, and the relative positions of the blocks to the vessel are also different. Therefore, the AGV arrival process is variable. In this article, the AGV arrival process is abstracted to a Poisson process, so we use and to regulate the arrival concentration of AGVs carrying import containers and empty AGVs. To validate the generalization performance on different AGV arrival processes, the policy learned on C40 training instances ( = 1/26, = 1/30) is employed to solve problems with different (1/22, 1/24, 1/28 and 1/30) and (1/26, 1/28, 1/32 and 1/34).
The scheduling performance of SST and DRL-based generalization models on different
is exhibited in
Table 6. Although the results of the AM-based generalization model fail to exceed the best scheduling rule, SST, the scheduling performance of the generalization model based on LICA-DMA is noticeably superior to other baselines.
The improvement curves against the best scheduling rule for varying
are described in
Figure 6 (bottom). As the curves show, the improvement increases with the growth of
. This phenomenon may be caused by the fact that the inherent enhancements of methods grow more quickly than the decline brought on by
differences. Indirectly, the generalization ability of the DRL methods is demonstrated. However the generalization of the AM-based model is worthless in practice due to its unsatisfactory scheduling results. As the LICA-DMA curve shows, the generalized model remarkably outperforms SST. Specifically, the improvements to various
are 5.19%, 8.16%, 11.23%, 12.69% and 13.04%, respectively.
From the characteristics discussed in
Section 3, we can infer that the effect of
on congestion is very limited. In the generalization experiment with different
, for each method, the scheduling differences caused by different
are extremely small. Therefore, the details of the generalization experiment for different
are not discussed here due to the limited space.
Remark 4. Benefiting from the LICA-DMA, the agent can eliminate the information interference of ineligible containers automatically and can complement the container information with congestion-related information. Compared with existing methods, our method has superior scheduling performance and robust generalization capability, which is promising for practical applications.