1. Introduction
Urban metro has played the role of a sustainable transportation mode in the sense that it provides a reliable and mass transportation service to commuters and has a relatively higher energy efficiency than other transportation modes. To achieve the desired level of customer service as well as minimizing operating costs, various aspects of operating issues need to be considered in the planning process of urban metro. As many different decisions and their levels are involved in the planning process, instead of considering these decisions simultaneously, the sequential decision making strategies are typically adopted. These strategies commonly consists of the line planning, timetabling, rolling stock circulation, and crew scheduling stages [
1,
2].
In the line planning stage, we want to find the operating frequencies of trains and a set of their stopping stations to satisfy travel demand in a given railway network. The decisions for the timetabling stage include the arrival and departure times of trains at each station it stops. Rolling stock circulation includes the decisions such as train composition (locomotives and carriages), vehicle circulation, maintenance policies, and so on. In the crew scheduling stage, the crew duties are allocated to train services while satisfying operational requirements. Note that decisions for a particular stage become a part of the input for the next stages.
In this paper, we focus on train timetabling for metro assuming that the decisions in the stage of line planning are given in advance. More specifically, we are interested in timetabling for Seoul metro’s Line 9 in South Korea. A distinguished feature of Line 9 is that there exist two types of trains: local trains and express trains. To provide a regular service for passengers, local trains stop every station in the line. When the trains stop at all stations, however, the travel times of passengers, in particular, of long-distance commuters, increase. Therefore, another type of train, express trains, which stop only at a predefined set of stations, is introduced. When these two types of trains are simultaneously run in the line, it is important to allow express trains to pass local trains. However, Line 9 consists of a single track for each direction between two consecutive stations and hence additional infrastructures at some stations are required through which overtaking between trains is possible. In Line 9, there is a set of stations, called overtaking stations, at which express trains are allowed to pass local trains through another track.
In this paper, we assume that the departure sequence of trains at their common initial station is given as input. To solve the timetabling problem, we decompose the problem into several smaller problems involving smaller number of trains and by combining the results of subproblems we can obtain the timetable for all trains. The decomposition is based on the the departure sequence of trains at the initial station and each subproblem considers a single express train and its preceding local trains. Assuming that overtaking between the same type of trains is not allowed, the result of one subproblem is independent of those of other subproblems. Thus, our problem is reduced to the timetabling problem involving local trains and a single express train in which the express train is the last train departing at the initial station.
One of the promising approaches to solve the reduced timetabling problem is to construct a mixed integer programming model for the problem and solve it using commercial solvers such as Gurobi, Cplex, and so on. In this paper, instead, we consider a reinforcement learning approach in which an optimal policy is learned. Note that due to small size of the reduced problem, MIP solvers can give an optima solutions in reasonable times. However, MIP solvers rely on the mathematical model for the problem and it is not easy to incorporate complex objectives and/or constraints which cannot be represented as linear functions on decision variables into the model. Reinforcement learning approaches, on the other hand, can incorporate complex objectives and/or constraints in the form of reward functions, and therefore have more applicability for diverse situations (e.g., considering energy consumption or passenger crowdedness).
This paper is organized as follows. In
Section 2, we introduce the relevant literature including the recent applications of reinforcement learning to train scheduling. A formal definition of our problem and its mathematical model are presented in
Section 3. The reinforcement learning framework for our problem is introduced in
Section 4, and its performance is demonstrated in
Section 5. Finally,
Section 6 presents the concluding remarks and possible directions for future research.
2. Literature Review
Train scheduling in metro systems has been studied for a long time. These studies can be broadly categorized into macroscopic and microscopic levels. The former considers the arrival and departure times of trains at stations ignoring some detailed information such as signaling systems and routing within stations [
3]. On the contrary, at the microscopic level, more detailed information such as routing in a station is taken into account [
4]. Another categorization can be made with respect to the variables representing the times (arrival/departure times of trains). In the discrete-time train scheduling, times are discretized and train schedules are defined on the time-space network where nodes represent the locations of a train at specified times [
5,
6]. In the continuous-time train scheduling, however, continuous variables are used to represent the arrival and departure times of trains [
7,
8].
Due to the intractability of the problem, both the exact algorithms and heuristics have been proposed to solve the train scheduling problem. Many of these approaches rely on mixed integer linear/nonlinear programming formulations. Some of them are solved via efficient enumeration schemes such as branch-and-bound and branch-and-cut algorithms [
8,
9]. Various metaheuristics have been also proposed. For example, Dundar and Sahin [
10] and Niu and Zhou [
11] proposed a genetic algorithm to solve the train scheduling problem. In particular, Dundar and Sahin [
10] combine an artificial neural network (ANN) into the genetic algorithm so that ANN is trained to mimic the decision behavior of train dispatchers.
As a relatively new approach, approximate dynamic programming (ADP), or reinforcement learning (RL), has also been used to solve train scheduling problems. This approach relies on formulating the problem as Markov decision processes (MDPs). To obtain an MDP formulation, it is necessary to recast the problem as a multi-stage decision process. There are two types of train scheduling problems to which ADP or RL approach has been applied: timetabling and rescheduling problem. Essentially, these two types of problems want to decide the optimal values of the same decision variables such as arrival/departure times of trains at stations. However, two problems lie in the different stages in the planning process. Timetabling is solved in the basic planning process introduced in
Section 1. On the other hand, rescheduling problem arises when the planned schedule resulted from solving timetabling problem is infeasible due to unplanned train delays or infrastructure failures. Thus, both problems can have different objectives and requirement for solution approaches.
Khadilkar [
12] and Liu, Li, Yang, and Yin [
13] consider the timetabling problems at a macroscopic level. Khadilkar [
12] considers the timetabling problem for bidirectional railway lines and therefore track allocations need to be also considered as well as arrival/departure times for trains at stations. The goal is to minimize the total priority-weighted delay. Liu et al. [
13] consider the timetabling problem for energy-efficient management of subway trains. It considers train headway for safety, train passenger loads, and the energy consumption along the subway line simultaneously.
In the rescheduling problem, both macroscopic [
14,
15] and microscopic [
16,
17] approaches exist. Jiang, Gu, Fan, Liu, and Zhu [
14] and Yin, Tang, Yang, Gao, and Ran [
15] want to minimize the inconvenience of passengers due to train delays, measured by waiting times of passengers at stations. In the case of Yin et al. [
15], minimizing the energy consumption is additionally considered as well as the waiting times of passengers. On the other hand, Ghasempour and Heydecker [
16] and Semrov, Marsetic, Zura, Todorovski, and Srdic [
17] try to minimize the train delays which are determined by the difference between the planned arrival/departure times of trains and the rescheduled arrival/departure times.
In the scheduling problem considered in this paper, we are only interested in the arrival and departure times of trains, and therefore we consider a train scheduling problem of a macroscopic level. The objective is to find a schedule which is as close as possible to some reference schedule. This objective is similar to the one used in Ghasempour and Heydecker [
16] and Semrov et al. [
17], but, as explained later, the reference schedule is not a feasible train schedule. Our approach is based on MDP formulation which will be solved via deep Q-learning technique (DQN [
18]) similar to Jiang et al. [
14]. As explained in
Section 4.1, however, our MDP formulation is different to those proposed in the previous literature. More precisely, our MDP formulation has different sets of states and actions. In particular, in our MDP formulation the schedule of only one train is directly handled through the choice of actions and the schedules of remaining trains are indirectly changed through the transition function. This contrasts with the previous works in which the schedules of all trains are sequentially handled by the choice of actions. Another distinction of our formulation is the use of linear programming (LP) model for the transition function. In our formulation, the choice of action corresponds to the determination of integer variables in the mixed integer programming (MIP) formulation of the problem. Thus, the resulting change of schedules can be obtained by solving LP model.
3. Problem Description
We consider a train scheduling problem on single-track unidirectional railway. Let be the set of stations, and all trains are assumed to begin their trips from station 1 and end at station S. There are two types of trains—local trains and express trains—denoted by and , respectively. The difference in type is due to the stopping patterns of trains. Local trains stop at every station, while express trains stop only at the predetermined set of stations, denoted by . We assume that all express trains have the same set of stopping stations. If there are only local trains, no overtaking is occurred. If not, an express trains can overtake its preceding normal trains at a predefined set of stations, denoted by , in which a sidetrack is prepared.
We consider the following operational environment.
- (E1)
There exists a safety headway, say h. Therefore, a train arriving (departing) at a station can arrive at (depart from) the station h time units after its immediate predecessor.
- (E2)
For two consecutive stations s and , where , there are minimum transit times between s and for local and express trains, denoted by and , respectively.
- (E3)
If a train stops at station, it should stay at least d time units.
- (E4)
Overtaking only occurs between local and express trains. Any overtaking between the same type of trains is not allowed.
We assume that the departing sequence of trains at station 1 is given. Thus, is the train which departs the initial station gth for . Conversely, is the departing order of train at the initial station. We also assume that there are departure times for after which train i can depart from station 1. Let us consider the departure times and arrival times for and obtained by scheduling all trains separately. This schedule is possibly infeasible as it can violate (E1) and (E4). However, this schedule is the most desirable one and is easily obtainable, and therefore we regard it as our target schedule, i.e., we want to find a schedule which is as close as possible to the target schedule.
4. The Proposed Method
The basic idea of the proposed method is to decompose a given instance of scheduling problem and then solve each subproblem separately. The solution of the problem is obtained simply by merging the solutions of subproblems.
In a typical daily schedule of metro, a number of local trains are arranged between two consecutive express trains. For example, in Seoul metro, 2–3 local trains are serviced between two express trains at peak times, while there are 4–5 local trains at non-peak times. The particular pattern of trains is determined by the sequence of local and express trains, for example, is a service pattern of 7 local trains and 3 express trains where L and E denote the normal and rapid trains, respectively.
The decomposition is based on the service pattern of the trains given as . Each subproblem consists of a single express train and its preceding local trains. Thus, for example, if a service pattern is , we obtain three subproblems, , , and . Note that according to the service pattern , we can order the subproblems. Let us consider the ith subproblem. By the operational environment (E4), the last local train in the ith subproblem cannot overtake the first local train in the th subproblem. Moreover, the express train in the ith subproblem usually moves faster than any local trains; the first local train in the th subproblem also cannot overtake the express train in the ith subproblem. Thus, if we assume that the express train in the ith subproblem cannot overtake any local trains in the th subproblem, subproblems can be assumed to be independent to each other. Therefore, in the remaining part of this section we focus on the method to solve the subproblem.
4.1. MDP Formulation
The subproblem consisting of an express train and its preceding local trains is essentially same as the original problem with smaller number of trains. Thus, it has the same MIP formulation as the original one, and its MIP formulation involves smaller number of variables and constraints than the original problem. In particular, integer variables only appear between a single express train and local trains at the overtaking stations. A key observation is that if we specify the overtaking stations at which the express train overtakes its immediate predecessor, we can determine the values of all integer variables in the MIP formulation of subproblem. In addition, an optimal solution with the corresponding set, say O, of chosen overtaking stations can be represented as the increasing sequence of sets of overtaking stations which ends with O. For example, an optimal solution with can be represented as and each schedule corresponding the subset of O in the sequence can be obtained by solving LP problem by fixing all integer variables.
Based on these observations, we reformulate the subproblem as a sequential decision-making problem such that at each stage we determine the next overtaking station at which the express train overtakes its immediate predecessor. This sequential decision making problem can be modeled as a Markov decision process (MDP) in which the state space, the action space, the reward, and the transition function are defined as follows (also represented in
Figure 1).
State space: A state represents the schedule of trains, that is, the arrival and departure times of trains at every station. Thus, a state can be represented by a vector .
Action space: At each time a decision is made, that is, at each state, the number of possible actions is equal to the number of overtaking stations plus 1. For each action corresponding to an overtaking station, a new state is determined by the transition function introduced later. These actions are called overtaking actions. A single additional action is reserved for the decision for stopping the further exploration of states or schedules, and it is called the finishing action.
Transition function: For each overtaking action, a new overtaking station is added to the current set of overtaking stations used by the current schedule. There are two cases: (1) a new overtaking station is either already chosen previously or just passed in the current schedule, and (2) a new feasible schedule is generated by using the resulting set of overtaking stations. For case (1), the current schedule is not changed, and therefore actions corresponding to this case are called the infeasible actions. The other actions are called the feasible actions. For each feasible action, a new set of overtaking stations is well defined and values for integer decision variables for the MIP formulation of the subproblem can be fixed. Thus, we can obtain linear programming (LP) model for the subproblem which is efficiently solvable. An optimal solution of LP can provide a new schedule of trains, i.e., a new state . When the finishing action is chosen, an agent stops exploring the states with declaring the current state or schedule is the final one.
Reward: For any state s, feasible actions generate the rewards of zero but for an infeasible action small negative reward is given to the agent. For the finishing action, the difference of qualities between the initial schedule and the current schedule is given as the reward. Thus, if the final state or schedule is better than the initial schedule a positive reward is awarded and the magnitude of the reward represents the level of improvement.
Figure 2 illustrates the process of the agent in the proposed MDP framework.
Figure 2 describes four steps of the agent’s decision-making. Step 1 is the initial schedule of four trains in which the first three trains (blue, yellow, and green lines) are local trains and the last train is an express train (red line). According to the decision of overtaking station the agent makes, the express train overtakes its preceding local train as shown in Step 2. Following the decisions shown in Step 3 and Step 4, the agent returns the schedule in Step 4 as a result of its finishing action.
4.2. Deep Q-Learning
Each instance of subproblem is defined by specifying the number of local trains preceding the express train and the target schedule of trains. In this paper, instead of trying to solve each instance whenever it is necessary, we try to learn an agent or a policy which can make near-optimal decisions for all instances. There has been several methods to learn such an agent (see, e.g., in [
19]). Among them, we use deep neural networks, i.e., deep Q-network (
DQN) [
18], which we estimate
Q-values defined as
which is the maximum sum of rewards
discounted by
at each time step
t after making an observation
s and action
a is chosen a policy
is used. The idea is to represent Q-value function as a recursive equations, known as Bellman equation, and iteratively update Q-values with these equations, i.e.,
is parameterized by
in the form of the deep neural network. The loss function which guides the update of
is
To train the neural network, random scenarios, i.e., instances of the subproblem are generated. For each scenario, the agent choose an action following the -greedy policy in which most of the time the agent choose the action of the large Q-value but there exists a small chance, , of choosing a random action. Then, the obtained experiences, for scenario i and time t, are used for defining the loss function , which is then minimized with respect to . The learned Q-network can be used to make a decision by choosing the action which attains the largest Q-value.
We also adopt two strategies proposed in [
18] to prevent a possible divergence of learning. The first strategy is using the experience replay memory which stores experiences
during the training and whenever updates of
are performed several experiences are randomly chosen in the memory to form the loss function. In the second strategy, we maintain two neural networks, called the target network and the behavior network. The target network, parameterized by
, is used to compute the target value in (
16) and the behavior network is used to explore the scenarios. Thus, at each iteration
i, the loss function is defined as
where the expectation is taken over the random experience in the memory. The parameters
of the target network are updated to the parameter
of the behavior network according to the predefined schedule.
5. Experiments
5.1. Instances
In our experiments, we generate instances for the subproblem consisting of a single express train and multiple local trains preceding the express train. These trains are assumed to run Seoul metro’s Line 9 in South Korea. There are 30 stations from station 1 to station 30, and all express trains stop at 12 of them: stations 1, 6, 9, 12, 14, 16, 19, 22, 24, 26, 28, and 29. Overtaking between local trains and express train can occur at stations 4, 6, 11, 15, 19, 23, and 27. Therefore, the number of actions is equal to 8, i.e., the number of overtaking stations plus 1. The dwell time of each train at its stopping station is equal to 30 s and safety headway is given as 60 s. The minimal transit times of each type of trains between two consecutive stations are given in
Table 1.
For a given departure time of a train at the initial station (station 1), the target schedule of the trains consists of the optimal schedules of each train ignoring other trains, that is, each train moves the line according to its minimum transit times between stations and staying for 30 s if it stops at stations. The aggregation of such schedules for each train possibly results in an infeasible schedule, but it is enough to take a role of the target schedule. Other choices of target schedules are also possible. On the other hand, for a given instance, the initial schedule is obtained by solving MIP of which integer variables are fixed by assuming that the orders of two trains are same with one at their initial station.
We assume that the subproblem consists of at most 4 trains, and therefore the number of local trains ranges from 1 to 3. For a given number of local trains and a single express train, the intervals of departure times at the initial station are randomly chosen between 120 s and 600 s.
5.2. Q-Learning
To train the agent, represented by a single neural network explained in the next subsection, we randomly generate an instance and this instance is solved by the agent. Random instances of a particular number of trains can vary according to the departure interval at their initial station. This process is called an episode. The experiences of the agent, consisting of the state, action, and the next state, will be used to train the neural network. The size of the experience memory is set to 200,000. To accumulate enough experiences, the update of the behavior network is postponed until 4000 episodes were explored. When the train begins, we randomly choose 8 experiences, and the weights of the neural network are updated whenever the agent takes an action. In addition, the parameter , initially set to 1, is diminished according to the factor of after each update of weights of the neural network. When is less than or equal to , the entire process of train ends. The target network is updated to the behavior network whenever an episode ends.
As explained earlier, for a given state s and the action a taken by the agent, the new state is obtained by solving the corresponding LP. LP is solved by PuLP. The optimal objective value is the absolute difference between the target schedule and the current schedule (the optimal solution of LP). Thus, the performance measure used in our experiments is the objective value of LP defined by the final state, i.e., the final schedule.
5.3. Deep Q-Network
The neural network representing the agent consists of 4 layers with 512, 256, 128, and 64 units, respectively. These 4 layers are fully connected. All layers use the rectified linear activation. The output layer consists of the units of which number is same as the number of actions, i.e., 8. The linear activation is used for the output layer. The neural network is randomly initialized with respect to the normal distribution with mean of zero and standard deviation of 1.
The input of the network is the state and therefore it is -dimensional vector where n is the number of trains. As our instances can have training up to 4 and therefore we can fix n to 4. Whenever an instance has trains less than 4, the remaining entries of the input are filled with zeros.
As stated earlier, 8 experiences randomly chosen from the experience memory constitute a mini-batch of size 8. We use Adam [
20] with a learning rate of
as the training algorithm for the neural network and 16 epochs are repeated with the same mini-batch.
The neural network and its training algorithm are implemented with Keras library with TensorFlow backend. All experiments are performed with a single NVDIA Quadro P500.
5.4. Results
We train the neural network with
episodes and it takes
s to finish training. We first evaluate the learned neural network with 50 random instances consisting of only two trains: one local train and one express train.
Table 2 demonstrates the result. The columns of the table indicate the instance number (#), the optimal objective value (MIP), the objective value from the learned agent (DQN), and optimality gap (Gap), respectively. The optimal gap is computed by
.
As observed in
Table 2, the learned policy successfully finds optimal schedules for all 50 instances.
Table 3 demonstrates the result tested on 50 random instances consisting of 3 trains, that is, 2 local trains. The structure of
Table 3 is same as that of
Table 2.
Among 50 instances, the learned policy succeeds at finding the optimal solutions for 36 instances. The gaps for the instances for which the learned policy fails to give optimal solutions are relatively low (within 10%), except for instances #1, #23, #29, and #44. However, the schedules obtained by the learned policy turn out to be largely improved from their initial schedules of which objective values are , and , respectively. Furthermore, the sequences of overtaking stations determined by two schedules are quite similar, for example, for optimal schedule and for the schedule by the learned policy.
Finally, we tested on random instances consisting of 4 trains, that is, three local trains. The structure of
Table 4 is again same as the previous two.
Table 4 shows that the performance of the learned policy is quite poor compared with the previous two results. The average gap over 50 instances is
(
and
for the previous two results, respectively). Four-train instances are more difficult than the 2- and 3-train instances. In particular, during the training phase of the neural network, the agent has little experiences on the cases of 2 and more overtaking stations. Indeed, by investigating the experience memory which stores the experience the agent used during training, we found that the proportions of the experiences considering 0, 1, 2, and 3 overtaking stations are about
,
,
, and
, respectively. Therefore, there exists a severe imbalance of experiences if a simple
-greedy exploration strategy is adopted. We conjectured that this phenomena can be avoided when we design more sophisticated strategies for the exploration during training.
6. Conclusions
In this paper, we consider a train scheduling problem in which both local and express trains are scheduled. For a given target schedule on a unidirectional single-track railway, we decompose the problem into separate subproblems in which a single express train and its preceding local trains are to be scheduled. We formulate each subproblem as a Markov decision problem and solve it with deep reinforcement learning technique.
We demonstrated the performance of the proposed method over 150 instances which are further divided into three groups with respect to the number of local trains. For the instances with a small number (2 and 3) of local trains, the proposed method is compatible with the exact algorithm (e.g., branch-and-bound or branch-and-cut methods). However, for larger instances, it still requires an improvement to obtain better schedules.
The advantage of the proposed method is that the learned policy can be used for any instances regardless of the number of local trains. Another advantage is its possible extension to use more complex objective function (e.g., energy consumption, passenger crowdedness, etc.) with alternative forms of rewards. However, this possibility is still conceptual and therefore empirical validations are needed as one of the future studies.
The obvious next step for future research is to devise an effective exploration strategy by which the agent can approximate the Q-value for diverse states. Furthermore, we do not have enough experiments on possible values of hyperparameters such as the architecture of the neural network and the parameters related to its training (e.g., learning rate). Therefore, more careful investigations, for example, Bayesian optimization, are required.