Automatic Discovery of Railway Train Driving Modes Using Unsupervised Deep Learning

Zheng, Han; Cui, Zanyang; Zhang, Xingchen

doi:10.3390/ijgi8070294

Open AccessArticle

Automatic Discovery of Railway Train Driving Modes Using Unsupervised Deep Learning

by

Han Zheng

,

Zanyang Cui

and

Xingchen Zhang

^*

School of Traffic and Transportation, Beijing Jiaotong University, No.3 Shang Yuan Cun, Hai Dian District, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2019, 8(7), 294; https://doi.org/10.3390/ijgi8070294

Submission received: 16 May 2019 / Revised: 23 June 2019 / Accepted: 24 June 2019 / Published: 27 June 2019

Download

Browse Figures

Versions Notes

Abstract

:

Driving modes play vital roles in understanding the stochastic nature of a railway system and can support studies of automatic driving and capacity utilization optimization. Integrated trajectory data containing information such as GPS trajectories and gear changes can be good proxies in the study of driving modes. However, in the absence of labeled data, discovering driving modes is challenging. In this paper, instead of classical models (railway-specified feature extraction and classical clustering), we used five deep unsupervised learning models to overcome this difficulty. In these models, adversarial autoencoders and stacked autoencoders are used as feature extractors, along with generative adversarial network-based and Kullback–Leibler (KL) divergence-based networks as clustering models. An experiment based on real and artificial datasets showed the following: (i) The proposed deep learning models outperform the classical models by 27.64% on average. (ii) Integrated trajectory data can improve the accuracy of unsupervised learning by approximately 13.78%. (iii) The different performance rankings of models based on indices with labeled data and indices without labeled data demonstrate the insufficiency of people’s understanding of the existing modes. This study also analyzes the relationship between the discovered modes and railway carrying capacity.

Keywords:

modes of driving railway trains; unsupervised learning; deep learning; generative adversarial networks; stacked autoencoders

1. Introduction

1.1. Background

Modes of driving railway trains (MDRTs) represent the potential patterns characterizing drivers’ behaviors when driving railway trains. MDRTs are an important concept in the research field of transportation and play a vital role in the understanding of the stochastic nature of railway systems [1,2].

A railway system contains large-scale heterogeneous hardware as well as corresponding complex software systems and is operated by staff consisting of differently skilled individuals and individuals with varying educational backgrounds. Therefore, the system inevitably has complex inherent stochastic properties. A good understanding of these inherent stochastic properties could help improve operational performance, from the planning phase to the implementation phases. Without such an understanding, implementations of plans will be hindered even if the plans are carefully scheduled. Any small-scale disturbance that appears in the system will cause a wide range of impacts (e.g., a slight train running delay at a certain station will cause a wide range of delays) [3,4,5,6,7]. The inherent stochastic nature comes from several sources, which can be roughly categorized into two main types: (i) Objective randomness (e.g., train running time fluctuations caused by changes in track surface adhesion during rain) and (ii) subjective uncertainty (e.g., different driving strategies that drivers may take when facing similar scenarios; in the absence of appropriate methods, we cannot predict their behavior in advance). Unlike objective randomness (which is determined by the natural attributes of hardware), subjective uncertainty stems from the behaviors of staff (e.g., dispatchers and drivers), where train driving behaviors are the most direct influences [8]. Thus, discovering the hidden modes in driving behaviors is significant when addressing such subjective uncertainty.

Train driving is a dynamic process with many constraints but also with a certain degree of autonomy. It reflects a trade-off between efficiency and robustness [1,6,8,9]. To ensure the robustness of railway systems, the operators of railways give autonomy to drivers in driving processes (this phenomenon widely exists in railway freight transportation and even in China’s highly information-influenced high-speed railway system). Under these circumstances, drivers are able to drive trains according to their experience and basic constraints. For example, MDRTs existing in sections (the connecting parts of adjacent stations) make the running time multivariate, even under the same running plan. If operators have minimal understanding of MDRTs and use poor models (e.g., constant values) to represent the distributions of parameters (e.g., running time and additional operating time), the resulting plans would be inaccurate and unenforceable in the future [2,10,11,12,13].

Train driving directly impacts the overall performance of the system [1,4,8]. However, due to the lack of suitable methods, the modes contained therein have been difficult to quantify. The main difficulties faced in discovering MDRTs are as follows.

Large number of modes. Compared to existing transportation modes, MDRTs are more inclined to describe differences and commence at the level of microcosmic driving operation [2,14]. Thus, the number of modes in MDRTs is much greater than that of existing mode detections. Figure 1 illustrates some of the modes corresponding to two running plans (i.e., stop at Station1-stop at Station2 and stop at Station1-pass Station2) as an example. Modes 0–2 stop at both Station1 and Station2. Modes 3–5 stop only at Station1 and pass Station2. These modes will be more complicated when the plans are expanded (e.g., pass Station1-stop at Station2 and pass Station1-pass Station2). Furthermore, the railway consists of a large number of stations and sections. MDRTs that exist in different sections are characterized by large differences. This situation further increases the need for unsupervised learning.

Large number of modes. Compared to existing transportation modes, MDRTs are more inclined to describe differences and commence at the level of microcosmic driving operation [2,14]. Thus, the number of modes in MDRTs is much greater than that of existing mode detections. Figure 1 illustrates some of the modes corresponding to two running plans (i.e., stop at Station1-stop at Station2 and stop at Station1-pass Station2) as an example. Modes 0–2 stop at both Station1 and Station2. Modes 3–5 stop only at Station1 and pass Station2. These modes will be more complicated when the plans are expanded (e.g., pass Station1-stop at Station2 and pass Station1-pass Station2). Furthermore, the railway consists of a large number of stations and sections. MDRTs that exist in different sections are characterized by large differences. This situation further increases the need for unsupervised learning.
Complex structures in the modes. Operators have previously used the running time to distinguish MDRTs. This method is inefficient; many modes are too close in running time to be distinguished [2]. Therefore, a method that fully considers the internal structure of the driving processes is required. However, the inner structure of these modes is complex, especially when multiple phases exist. For example, both mode 3 and mode 4 in Figure 1 have two unfixed-location acceleration phases, which are difficult to classify manually. The difficulty of MDRT analyses caused by the multiple phases has also been mentioned in other studies [6,7,15]. There remains no effective way to automatically discover a large number of modes from this type of complex-structure data.
Multi-profile data. The integrated trajectory data are aggregated time-series data that contain multiple profiles obtained from built-in locomotive control and information systems. We call this integrated trajectory data because all these profiles rely on spatio-temporal GPS trajectories for storage.

In modern freight railway systems, locomotives have equipped built-in control and information systems. Benefiting from these systems, large-scale integrated trajectory data can be obtained for the study of latent modes underlying railway driver behaviors. Taking a built-in locomotive control and information system applied in China as an example, Figure 2 illustrates the basic structure of the system, as well as the resulting integrated trajectory data.

The integrated trajectory data contain several profiles obtained from sub-systems (e.g., signal and gear profiles) and sensors (e.g., speed, location, and pressure profiles) in the system. The sensors involved in this study are GPS sensors, speed sensors, and pressure sensors. The GPS sensor provides the system with the time and space information of the train by analyzing the relationships between the train and the satellite system. There are two types of speed sensors: The locomotive shaft end photoelectric speed sensor and the Hall magnetoelectric speed sensor. Two sensors are mounted on the axle end of the locomotive to detect the axle speed and convert it into an electrical pulse signal proportional to the wheel speed. The railway locomotive has an independent speed sensor that can be mutually corrected with GPS data to overcome the problem of unstable signal acquisition caused by complex terrain and weather. The pressure sensor provides the system with pressure signals for the train pips, brake cylinders, and balance air cylinders (the pressure states of these devices directly reflect the train’s braking condition). The basic idea is to convert the measured gas (liquid) physical pressure into a standard electrical signal through a resistive pressure sensitive element. Through the signal adjustment circuit, the sensor outputs a voltage signal that is linearly proportional to the measured pressure.

The recording mode of the built-in control and information system is based on events (i.e., it records attributes with respect to changes in speed, pressure, signals, acceleration, etc.), and the relations among these profiles are complex. Thus, over a long period of time, the data are stored but not utilized for analyzing modes; there is still no specific and efficient method for finding modes from the multi-profile data.

iv.: Lack of labeled data. The information system used by railways was originally designed as a safety-guard system that can assist drivers or dispatchers in avoiding over-speed or violation of signals. Neither the interface nor the built-in function modules are designed for automated analysis. Therefore, there are no ready-made automated analysis results. In addition, as sequence data with complex structures as well as multiple profiles, integrated railway trajectory data are difficult to label manually. On the one hand, only a small number of personnel with certain experience can recognize the modes existing in the data. On the other hand, multi-profile integrated trajectory data can hardly be analyzed directly by humans, which makes manual labeling more difficult. Given such limited labeled data, supervised models ultimately suffer from overfitting when adapted to very flexible deep neural networks.

Motivated by this observation, an unsupervised approach that can automatically discover modes is desired. Such an approach needs to learn important features from integrated train driving trajectory data and then discover the existing modes within an acceptable amount of time. Therefore, this paper attempts to use the unsupervised deep learning method to satisfy the actual requirements of automatically discovering train driving modes. To achieve this goal, we need to overcome two main difficulties:

Lack of benchmark. Research on MDRTs is still in its infancy; thus, there is no well-accepted benchmark data, which will hinder the evaluation of the unsupervised learning models.
Prior knowledge is not accurate. Even experienced employees on site may have an incorrect or inadequate understanding of the mode distribution.

The unsupervised approach should contain unsupervised feature learning as well as unsupervised mode discovery. Once the train has completed a portion of the work, its trajectory data will be immediately processed and stored. After a certain period of accumulation, the operator can analyze the MDRTs in terms of proportion and internal characteristics. The information obtained can be used for multivariate running time calculations (for planning) or driver driving strategy extraction (for new driver training). These utilizations of MDRT can be found in the following fields: Driver behavior habits [6,7,15,16], auto-driving system design [6], and capacity utilization optimization [17,18].

The contributions of this paper are four-fold:

We propose five unsupervised deep learning models and prove that they achieve better performance (by 27.64% on average) than classical unsupervised learning models on real and artificial datasets with different scales. Additionally, we prove that the adversarial autoencoder clustering model obtains the best performance.
We prove that integrated trajectory data can improve the accuracy of unsupervised learning compared to pure GPS trajectory data (by approximately 13.78%).
We discover the mode distribution in real dataset and analyze their characteristics based on phases.
We measure the difference between model-predicted distributions of data and the labeled distributions by operators, namely, the gap between the unsupervised learning outcomes and the subjective recognition results, based on indices with ground-truth labels.

The remainder of this paper is organized as follows. The relevant literature is reviewed in the following section (Section 1.2) before a detailed introduction of the methodologies used in this study (Section 2). Related experiments from the study and discussion (Section 3) are then presented before our conclusions (Section 4) on the study.

1.2. Related Works

The analysis and application of concepts similar to MDRTs have been hot spots in studies on railway data mining [6,7,15,16,17,18]. A multi-phase analysis framework [6] was built to support many complex railway studies such as large-scale micro-simulations [15] and timetable optimizations [7]. In this framework, a driving process is separated by motion phases: Acceleration, cruising, coasting, and braking. These phases are analyzed from the perspectives of speed and distance to obtain various characteristics. Multi-phase phenomena have been discovered, especially in the braking process. However, these studies are based on limited-scale datasets and are focused on the applications of MDRTs (e.g., to develop a stochastic blocking time model for capacity utilization [6,7]). Before these studies, there was a lack of an approach to automatically and effectively discovering MDRTs in a big data context. This has also led to the fact that in the current study, there is no benchmark for everyone to recognize.

Studying modes requires proxies that can provide sufficient information. In existing study researches, both GPS trajectory data and mixed trajectory data obtained from other sensors are used as proxies to study modes. Pure GPS trajectory data have been used as conventional proxies to identify modes of transportation [19]. At the same time, the use of comprehensive data has also received attention [20]. Trajectory data acquisition via smartphones and multi-sensor systems has proven to be feasible in real-world applications. These two kinds data have their own pros and cons. Pure GPS trajectory data have many advantages, such as being low cost [21], having few spatiotemporal limitations [22], providing sufficient information [19,21,22], and relatively maturing techniques [23,24,25,26]. However, this single source trajectory data is less stable in complex environments. There is a lot of noise in the GPS signal in a railway environment such as caves. In contrast, the integrated data are more robust, and can cope with complex acquisition environments, and the information contained is more abundant. However, the data collection cost is relatively higher [20,27]. In the context of this article, the used data can be obtained from the existing locomotive sensor system. Therefore, using integrated data will result in better accuracy without increasing costs. The integrated trajectory data used in this paper are a type of mixed GPS data; all profiles are attached to the GPS trajectory data.

A set of discriminative features plays a vital role in finding modes hidden in transportation trajectory data. In existing studies, supervised features (i.e., artificially designed features) such as global statistical features [14,20,27,28,29,30,31,32], local features [19,33], time-domain features [20], frequency-domain features [20,27], and specific features [28,30,34,35,36,37] have been extracted. Zheng et al. [2] proposed a type of feature that can reveal the differences between railway driving behaviors, therein being designed for a railway mode identification context. These features are designed by hand and are quick and easy to calculate. However, these features do not fully accommodate the demands of discovering MDRTs; these features have poor ability to characterize microscopic differences and are either inapplicable for addressing integrated multi-profile trajectory data directly. In addition they are insufficiently integrated with clustering methods, which means the joint training cannot be achieved.

The development of unsupervised learning models and algorithms has given us more choices in design methods in terms of both feature extraction and mode discovery. Classical unsupervised feature extraction methods, such as parameter t-SNE [38] and PCA [39], have been used in the field of unsupervised feature learning. Recently, deep learning methods have been developing rapidly; stacked autoencoders (SAE) [40,41], generative adversarial networks (GAN) [42], etc. have become important feature learning methods. Both GAN and SAE aim to match the real data distribution and simultaneously provide a mapping from a latent space to the input space. However, the mechanisms behind these two methods are not the same: SAE using an explicit approximation of maximum likelihood and GAN through implicit sampling [43]. Which method is more appropriate for the integrated trajectory data is a question we need to explore.

Clustering is the most effective method for discovering modes in feature spaces without sufficient labeled data. Deep learning also plays an important role in clustering [40]. Adversarial autoencoders [42], Kullback–Leibler (KL)-divergence-based networks [41], and categorical generative adversarial networks [44] have applications in the field of clustering. The joint training of the feature extractors as well as cluster models has been proven to be better than independent training [41]. Therefore, the performances of the above clustering methods combined with different feature extraction methods have become the focus of our attention.

In this paper, we propose an approach using deep unsupervised learning models to discover hybrid transportation modes from integrated trajectory data, which does not require prior labeled data. Five models, being combinations of different feature extractors and clustering models, are proposed. Deep learning models, such as adversarial autoencoders and stacked autoencoders, are designed as feature extractors, while generative adversarial network-based and KL divergence-based networks are used as clustering models. The reason for choosing these five models is that we want to explore the performance of two different types of deep learning methods (i.e., SAE, and GAN) in the face of integrated trajectory data. We also hoped to find a better combination of feature extraction and clustering method in joint training. In addition, integrated multi-profile data are used to improve learning performance.

2. Methodologies

2.1. Methodology Set Up

In this paper, we proposed an unsupervised deep learning approach to discover MDRTs from integrated trajectory data. The main modules of our discovery methodologies are shown in Figure 3, which includes three main parts: I. Preprocessing, II. The unsupervised deep learning models, and III. The evaluation metric.

During preprocessing four data preprocessing techniques were employed. First, we removed duplicate data in the trajectories, as some data points were recorded more than once due to the designed redundancy mechanism [2]. Second, based on experience, we removed outlying trajectories that were deemed abnormal. For instance, if the average speed of a trajectory exceeded the designed speed, we identified it as an outlying trajectory and removed it from the dataset. In addition, we detected and removed outlying data points from the raw trajectories. Figure 4 illustrates the existence of outlying data points in trajectories. The method proposed by [45] was used to detect outlying data points. Third, we divided the full recorded trajectories (which recorded all information without interruption) into segments. The criterion of the divisions depended on the demands of the mode analyses. In this study, we focused on analyzing modes in sections other than at stations; therefore, we used the train stop criteria proposed in [2] to obtain segments from the trajectories. Fourth, we used z-normalization [46] to eliminate the corresponding distortions before mode discovery. Using this method, we can achieve scaling and translation invariances. The formula for z-normalization is illustrated in Equation (1).

x_{n o r m a l i z a t i o n} = (x - μ) / σ

(1)

where

μ

denotes the mean value of the sample data and

μ

denotes the standard deviation of the sample data.

For the unsupervised deep learning models, five deep learning models were proposed. These five models were combinations of different deep learning-based feature extractors and clustering models, including adversarial autoencoders and stacked autoencoders as feature extractors as well as generative adversarial network-based and KL divergence-based networks as clustering models. We detailed each of the proposed unsupervised deep learning models in Section 2.2. Although we analyzed the proposed models from two perspectives, feature extractors and clustering models, they were jointly trained; the joint training of the feature extractors as well as cluster models has been proven to be better than independent training [41].

As to the evaluation metric, we used two systems of indices to evaluate the performances of the different methods: indices without ground-truth labels (discussed in Section 2.3.1) and indices with ground-truth labels (discussed in Section 2.3.2). The former can reveal the performance of each method from the aspect of cluster distributions. The latter can reveal similarities between clustering results and real demands. The detected modes need to be interpretable in the context of railway production. The stronger the similarities are, the closer the clustering result is to people’s understandings. Ground-truth labels are obtained from experts who are veterans in driving or controlling railway trains.

2.2. Unsupervised Deep Learning Models

In this section, we introduced five models of unsupervised deep learning, along with corresponding structures, parameters, and learning methods, to address the mode discovery problem.

2.2.1. Parameters Tuning

Tuning the optimal hyperparameters of the deep learning model is not an easy task [41]. In practice, Bayesian optimization has been shown to obtain better results in fewer evaluations compared to grid search and random search due to the ability to reason with respect to the quality of experiments before they are run [47,48,49,50]. Therefore, we chose Bayesian optimization as the parameter tuning method. The details of Bayesian optimization can be found in references [47,48,49,50]. We selected the number of neurons in each layer of the deep learning network as hyperparameters and DVI as the evaluation index. The detailed formula of DVI is shown in Section 2.3.1. The results of the parameter tuning are presented in the Appendix A, from Table A1, Table A2, Table A3, Table A4 and Table A5. According to the results, the deep learning method was proved to be not very sensitive to the parameters; the index did not change significantly. The selected hyperparameters are shown in Section 2.2.2, Section 2.2.3, Section 2.2.4, Section 2.2.5 and Section 2.2.6.

2.2.2. Adversarial Autoencoder Clustering Model (AAEC)

The adversarial autoencoder [42], which is a derivative of the generative adversarial network [51], was first proposed as a tool for extracting features. To perform clustering, Alireza et al. [42] modified the adversarial autoencoder such that the continuous latent feature variables are replaced by discrete latent class variables and such that the adversarial part imposes a categorical distribution on the cluster representation instead of a Gaussian distribution. Using the adversarial autoencoder as a feature extractor and the modified adversarial autoencoder as a cluster method, we obtained the adversarial autoencoder clustering model (AAEC), which can disentangle discrete class variables from the continuous latent feature variables in a purely unsupervised manner [42].

In this model, the data are generated by a latent class variable

y

that comes from a categorical distribution, as well as a continuous latent variable

z

that comes from a Gaussian distribution:

\begin{matrix} p (y) = C a t (y), \\ p (z) = N (z | 0, I) . \end{matrix}

(2)

Figure 5 illustrates the basic framework of AAEC. The inference network of the AAEC predicts both the discrete class variable

y

and the continuous latent variable

z

. The decoder then utilizes both the class label as a one-hot vector and the feature to reconstruct the profile. There are two separate adversarial networks that regularize the hidden representation of the autoencoder. The top adversarial network imposes a categorical distribution on the label representation. This adversarial network ensures that the latent class variable

y

does not carry any style information and that the aggregated posterior distribution of

y

matches the categorical distribution. The bottom adversarial network imposes a Gaussian distribution on the style representation, which ensures that the latent feature

z

is a continuous Gaussian variable. The details of each sub-network are shown in Table 1.

Both the adversarial networks and the autoencoder are trained jointly with stochastic gradient descent (SGD) in two phases: Phase I and Phase II. In Phase I, the autoencoder updates the encoder and decoder to minimize the reconstruction error of the inputs on an unlabeled mini-batch. In Phase II, each of the adversarial networks first updates their discriminative network to discriminate the true samples (generated using the categorical and Gaussian priors) from the generated samples (the cluster and feature computed by the autoencoder). The adversarial networks then update their generator to confuse their discriminative networks. The learning flow is shown in Table 2, where the training steps are illustrated in order. The detailed parameters of each learning step are shown in Table 3.

2.2.3. Deep Embedded Clustering (DEC)

In the deep embedded clustering (DEC) model, we used a stacked autoencoder to extract features from profiles and used the KLD (Kullback–Leibler divergence)-based method to conduct clustering. Inspired by the t-SNE parameter [38], this model was first proposed by Xie et al. [41] and outperforms other methods such as LDGMI [52].

The DEC clusters data by simultaneously learning a set of

k

cluster centers

{u_{j} \in Z}_{j = 1}^{k}

in the feature space

Z

and the parameters

θ

of the DNN that maps data points into

Z

. The DEC has two phases: Phase I, i.e., feature learning with a stacked autoencoder (SAE) [40], and Phase II, i.e., clustering, where we iterate between computing an auxiliary target distribution and minimizing the Kullback–Leibler divergence to it (given an initial estimate of

θ

and

{u_{j}}_{j = 1}^{k}

). Phase II is optimized by the stochastic gradient descent (SGD)-based joint algorithm [41]. The objective function of Phase II is shown in Equation (3):

\begin{matrix} L = K L (P | | Q) = \sum_{i} \sum_{j} p_{i j} \log p_{i j} / q_{i j}, \\ q_{i j} = {(1 + ∥ z_{i} - u_{j} ∥^{2} / α)}^{- \frac{α + 1}{2}} / \sum_{j^{'}} {(1 + ∥ z_{i} - u_{j^{'}} ∥^{2} / α)}^{- \frac{α + 1}{2}}, \\ p_{i j} = \frac{q_{i j}^{2} / f_{j}}{\sum_{j^{'}} q_{i j}^{2} / f_{j}}, \end{matrix}

(3)

where

z_{i} = f_{θ} (x_{i}) \in Z

corresponds to

x_{i} \in X

after embedding,

α

are the degrees of freedom of the Student’s t-distribution (we set

α = 1

in this study based on previous studies [38,41]),

q_{i j}

can be interpreted as the probability of assigning sample

i

to cluster

j

(i.e., a soft assignment), and

p_{i j}

is the auxiliary distribution proposed by Xie et al. [41].

Inspired by Maaten [38], the structure of the DEC model is constructed (shown in Figure 6), and the details of each sub-network are shown in Table 4. The learning flow is shown in Table 5, where the training steps are illustrated in order. The parameters of each learning step are detailed in Table 6. To initialize the centroids, we ran k-means with 20 restarts and selected the best solution.

2.2.4. Model Consisting of AAE and KLD-based Cluster (AAEKC)

We used a model consisting of an AAE (as a feature extractor) and a KLD-based network (as a cluster), denoted as AAE and KLD-based cluster (AAEKC).

The structure of the AAEKC model is constructed (shown in Figure 7). There are two phases in the AAEKC training process: Phase I updates the autoencoder (Encoder + Decoder) to minimize the reconstruction error of the inputs. Phase II trains the discriminator and then the generator. Phase III performs clustering, where we iterate between computing an auxiliary target distribution and minimizing the Kullback–Leibler divergence to it, as discussed in Section 2.2.2. The details of each sub-network are shown in Table 7. The learning flow is shown in Table 8, where the training steps are illustrated in order. The detailed parameters of each learning step are shown in Table 9. To initialize the centroids, we ran k-means with 20 restarts and selected the best solution.

2.2.5. Model Consisting of SAE and CatGAN (SAECC)

SAE and CatGAN (SAECC) is a model in which stacked autoencoders play a role in extracting features, and categorical generative adversarial networks (CatGANs) [44] play a role in clustering.

The discriminator in the standard GAN is designed to determine whether a given example

x

belongs to dataset

X

; this is a binary soft classification task. To address cluster issues, Springenberg [44] extended the standard GAN framework as CatGAN such that the discriminator can be used for clustering. Instead of learning a binary discriminative function, the CatGAN has a discriminator that can separate the data into

C

clusters by assigning a label

y

to each example

x

. Formally, the discriminator

D (x)

is defined for this setting as a differentiable function predicting logits for

C

clusters:

D (x) \in ℕ^{K}

. The probability of example

x

belonging to one of the

C

mutually exclusive clusters is then given through a softmax assignment based on the discriminator output:

p (y = k | x, D) = \frac{e^{D_{k} (x)}}{\sum_{k = 1}^{K} e^{D_{k} (x)}}

(4)

However, the pure CatGAN underperforms the AAEC [42]. Thus, a stacked autoencoder is introduced to extract features for the CatGAN. We illustrate the network structure of SAECC in Figure 8, including the Generator, Discriminator, Encoder, and Decoder. The details of each sub-network are shown in Table 10. The training of this model has two phases. Phase I performs feature learning with the stacked autoencoder (Encoder + Decoder). Phase II performs clustering by the CatGAN, including two sub-phases: Training Generator and Discriminator iteratively. During the iterations of Phase II, samples from a dataset as well as from a Gaussian distribution are used to train the generator and discriminator, respectively. Due to the discussed change to the Discriminator, two types of entropies are introduced to calculate the losses of the trainings [44]. The learning flow is shown in Table 11, where the training steps are illustrated in order. The parameters of each learning step are detailed in Table 12.

2.2.6. Model Consisting of AAE and CatGAN (AAECC)

AAE and CatGAN (AAECC) is a model in which adversarial autoencoders play a role in extracting features, and the CatGAN plays a role in clustering. We illustrate the structure of AAECC in Figure 9. Each sub-network is detailed in Table 13.

The learning process of AAECC has three phases: Phase I updates the autoencoder (Encoder + Decoder) to minimize the reconstruction error of the inputs. Phase II trains Discriminator 1 and then Generator of the GAN. Phase III implements clustering by the CatGAN, consisting of two sub-phases: Training Generator and Discriminator 2 iteratively. During the iterations of Phase III, samples from the dataset and from a Gaussian distribution are used to train the generator and discriminator, respectively. The learning flow is shown in Table 14, where the training steps are illustrated in order. The parameters of each learning step are detailed in Table 15.

2.3. Evaluation Metric

In our study, we used two metric systems, with and without labeled data, as the tools to evaluate the performance of each method.

2.3.1. Evaluation Indices Without Ground-Truth Labels

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). Compactness and divergence are two essential aspects in evaluating the quality of the clustering results. However, these two aspects are not easily measured directly. We used an evaluation system that does not require ground-truth labeled data as a proxy of compactness and divergence to evaluate the performance of each method. This system contains six evaluation indices that do not require labeled data, as shown in Table 16.

2.3.2. Evaluation Indices with Ground-Truth Labels

In addition to the evaluation of the relationships among elements and clusters, we also wanted to observe the gap between the clustering results and the results of human recognition. Table 17 introduces indices with ground-truth labels, which can be used to measure the similarity between the true values and the predicted values of the unsupervised models. If the true values were labeled by experienced railway operators, we believed that the indices could accurately measure the gap. The indices are illustrated in Table 17.

2.3.3. Cluster Number Determination and Partial Data Labeling

Cluster number determination is the previous step of data labeling because of the lack of benchmark and prior knowledge. To find the optimal number of clusters, we trained the proposed deep learning models one by one under different cluster numbers and evaluated the results by the evaluation system discussed in Section 2.3.1.

Here, we used an example to elaborate the process of determining the number of clusters. To find the optimal number of clusters of a dataset named RD9 (its definition is given in Section 3), the five proposed deep learning models were trained with different cluster numbers

n_{c}

from a candidate set

{5, 6, 7, \dots, 13, 14, 15}

. The settings of the models are shown in Table 18. Then, we obtained the indices illustrated in Figure 10.

The optimal number of clusters that we were looking for is the number of clusters corresponding to the peaks of the

D V I

,

S P

, and

L

sequences and the valleys of the

C P

and performance difference

(D B I)

sequences. In addition, in the range around the optimal value, there should be a stable

M C D

. In Figure 10, when the cluster number was less than 13, the

D V I

,

S P

, and

L

sequences presented fluctuations in most of the five models. A peak was apparent at 13 clusters. After this point, these three indices began to decrease. On the other hand, the

C P

and

D B I

sequences had several valleys, for example, seven and 13 clusters for

C P

and nine and 13 clusters for

D B I

. Additionally,

M C D

began to stabilize after 12 clusters. Considering these indices together, we chose 13 clusters as the optimal cluster number.

After obtaining the optimal number of clusters, we invited experts from the railway system, including excellent dispatchers and train drivers, to manually label a portion of the data as the ground-truth labels.

3. Results and Discussion

A dataset of trajectories over a nine-year period (from January 2009 to January 2018) from a section (Dongsheng-Aobaogou) of the Baoshen Railway, Inner Mongolia, China was collected for numerical experiments. The data were recorded in an “event-by-event” manner, meaning that any change in state will add a new row of data. For the nine years of data, we obtained data for every six months. This dataset contained 44,282 trajectories (approximately 13.48 trajectories per day) produced by 113 locomotives, with four profile types (i.e., speed-location profile, gear location-location profile, signal-location profile, and time-profile profile).

Based on the actual data above, we defined three sub-datasets from the utilized dataset: (i) The RD2, a dataset containing two years (January 2009–January 2011) of integrated trajectories, (ii) the RD9, a dataset containing nine years (January 2009–January 2018) of integrated trajectories, (iii) the PRD9, a dataset containing nine years (January 2009–January 2018) of pure GPS trajectories, and (iv) the PRD2, a dataset containing two years (January 2009–January 2011) of pure GPS trajectories.

To further explore the performance characteristics of the proposed models, we wanted to introduce different types of datasets (including different data sizes and different mode distributions). However, due to the lack of a recognized benchmark, we artificially generated a series of datasets in which data were generated by simulated drivers (i.e., people simulating driving and producing output trajectories according to a series of instructions). Although these data were not as detailed as real data, they could be customized in size and type and could be used as important data for testing model performance. This study used artificial data accumulated over five years. Based on these artificial data, we defined the following datasets: (i) The AUBSD, artificial unbalanced small-scale dataset, contained artificially generated trajectories (small scale, i.e., 5000 trajectories), and the number of trajectories representing different modes varies significantly. (ii) The AUBLD, artificial unbalanced large-scale dataset, contained artificially generated trajectories (large scale, i.e., 40,000 trajectories), and the number of trajectories representing different modes varied significantly. (iii) The ABSD, artificial balanced small-scale dataset, contained artificially generated trajectories (small scale, i.e., 5000 trajectories), and the number of trajectories contained in different modes was approximately balanced. (iv) The ABLD, artificial balanced large-scale dataset, contained artificially generated trajectories (large scale, i.e., 40,000 trajectories), and the number of trajectories representing different modes was approximately balanced. After applying the preprocessing method discussed in Section 2.1, trajectories that were suitable for machine learning were obtained. Then, we conducted experiments as described in Section 3.1.

3.1. Experimental Settings

The numerical experiment consists of four steps. In step (i), we analyzed the mode distribution of a given dataset (RD9) according to the results of the cluster number determination (discussed in Section 2.3.3).

In step (ii), depending on the optimal number of clusters

n_{c}

, we analyzed the performances of the five proposed deep learning models and two comparison classical models on different datasets. These datasets were different in scale and mode distribution. The settings of this step are shown in Table 19, where 2.1–2.2 were the comparison models and 2.3–2.7 were the proposed models. The criterion for the comparison was the evaluation system discussed in Section 2.3.1. The results of step (ii) could indicate the performances of each model on different datasets.

In step (iii), we tested the pure GPS trajectory data (from PRD9) and integrated trajectory data (from RD9). Based on the optimal number of clusters

n_{c}

and the performances of the five proposed models, we selected the top two models and trained them on different datasets; then we evaluated the corresponding results. The result of step (iii) could show the necessity of using integrated trajectory data. Additionally, by analyzing the results combined with those of previous steps, we could observe the detailed impact of integrated data on unsupervised learning. The settings of this step are shown in Table 20.

In step (iv), we invited experts from the railway system, including excellent dispatchers and train drivers, to manually label the data in RD2 as the ground-truth labels. We trained the five proposed deep learning models on the integrated trajectory data (from RD2) as well as on pure GPS trajectory data (from PRD2); then, we evaluated the corresponding results by the index system discussed in Section 2.3.2. The results of step (iv) revealed (1) the gap between the unsupervised learning outcomes and the subjective recognition results, as well as (2) the differences in the training performances between pure GPS trajectory data and integrated trajectories. The settings of this step are shown in Table 21.

3.2. Results and Discussions

3.2.1. Result and Discussion of Step (i)

Figure 11 lists the found modes and the corresponding counts. From the 13 figures, we can see that the trajectories in one mode followed the same trend; this trend was also the embodiment of the driving phases. According to the definition of phases in [6], we could attempt to analyze the driving phases based on the mode. For example, in mode (a), the phase structure was A(42-44)-Cr(44-52)-B(52-53)-Cr(53-53.8)-B(53.8-54), and that of mode (b) was A(42,46)-Cr(46,49.5)-Co(49.5,52)-Cr(52,53.8)-B(53.8,54), where ‘A’ represents the acceleration phase, ‘Cr’ represents the cruising phase (maintaining speed with throttle manipulation), ‘Co’ represents the coasting phase (train operation while the throttle is idle before braking), and ‘B’ represents the braking phase (for a station stop, for speed restriction or for a restrictive signal). The numbers in brackets were approximate location ranges of the phases (km). From this application, the necessity of mode analysis could be seen. Additionally, we can see from (n) in Figure 11 that the distribution of modes in reality was not balanced; RD9 represents large-scale unbalanced data.

3.2.2. Result and Discussion of Step (ii)

We tested the two comparison models and five proposed models on the AUBSD, AUBLD, ABSD, ABLD, and RD9 datasets, where the results are shown in Figure 12. Focusing on differences in model performances on the datasets, we observed the following phenomena:

(1): The performances of almost all models were relatively high on the balanced datasets (ABSD and ABLD) and relatively low on the unbalanced datasets (AUBSD, AUBLD, and RD9). For example, $D V I$ , $S P$ , and $L$ on the balanced datasets were larger than on the unbalanced datasets. The imbalance of data had a greater impact on classical models than the proposed deep learning models. This was expressed in the corresponding reduction in the indices: For example, the $S P$ of k-means was 2331.6 on the ABSD and 2025.0 on the AUBSD, a reduction of 13.1%. Correspondingly, the $S P$ of AAEC was 2399.9 on the ABSD and 2253.1 on the AUBSD, decreasing by 6.1%. A similar situation existed for the other indices.
(2): By analyzing the performance of each model individually, we found that when the dataset was small, there was only a small difference between the performances of the models. As the scale increased, the performances of the models began to decrease. For example, the CatGAN-based models performed well on small datasets but less so on the large datasets. Classical models poorly handled the large datasets; their performances decreased significantly on the AUBLD and ABLD datasets. For example, on the AUBSD dataset, the $D B I$ of SEC was 7.2. However, on the AUBLD, the $D B I$ of SEC increased by 56.9% (11.3). In contrast, the deep learning methods, especially DEC and AAEC, still achieved relatively high performances on the large datasets, although they did show a decline.

In order to quantitatively compare the performance differences between the proposed models and the classical models. We used large-scale datasets as the analysis object and selected

D B I

as the index for analysis. From Table 22 we can see that the proposed models had a significant performance advantage over the classical models. Namely, indices of the proposed methods were better (lower) 32.94% (AUBLD), 26.01% (ABLD), and 23.95% (RD9) than the indices of the classical methods. In other word, the proposed deep learning models outperformed the classical models by 27.64% on average.

In conjunction with Figure 10, we analyzed the performance differences between the proposed models. By analyzing the distribution of these indices shown in Figure 10, we could compare the performance of each method. Before 11 clusters, in almost all the indices, the performances of each model were close and low, and the rankings of each model fluctuated. After 11 clusters, the index sequences began to show more obvious stratification. Thus, we chose the indices after 11 clusters, especially the indices with 13 clusters (the optimal number of clusters), to analyze the performance rankings. From the perspectives of

D V I

,

S P

, and

L

, the top three models were the same: AAEC, DEC, and AAEKC. This ranking was also found for

C P

and

D B I

. In a comprehensive analysis of each index, the models’ rankings in this dataset were AAEC, DEC, AAEKC, SAECC, and AAECC.

This table contains three rows. The first row is the

D B I

values of each model in different datasets, the second one is the average

D B I

value of the proposed models and the classic models, and the third one is the performance improvement of the proposed models in different datasets.

3.2.3. Result and Discussion of Step (iii)

The difference in the training performance between the pure GPS trajectory data and the integrated trajectory data could be seen by analyzing Figure 13, which illustrates the results of Step (iii). From the values of

D V I

,

S P

,

L

,

C P

, and

D B I

, we found that the two models on the RD9 dataset outperformed those on the PRD9 dataset. For example, by calculating the differences in

D B I

, we found that performances of models on RD9 were 13.78% better than on PRD9 on average. Combining the results of Step (i) (shown in Figure 11), we deduced how the integrated data affect unsupervised learning.

The integrated data enhanced the performance of unsupervised learning by providing more details about the driving operations. For example, the driving belonging to modes (d) and (e) in Figure 11 presents different enter-station behaviors: Mode (d) shows an obvious re-acceleration process (phase), rather than the cruising process (phase) in mode (e). However, these two modes cannot be clearly identified using pure GPS trajectory data, which were misidentified as belonging to the same cluster. Although pure GPS trajectory data could be used for unsupervised mode learning, their accuracy was far lower than integrated trajectory data.

3.2.4. Result and Discussion of Step (iv)

In Step (iv), we used the evaluation indices with ground-truth labels discussed in Section 2.3.2. Although the absolute value of these indices cannot be used to evaluate the performances of the models, the relative values of these indices can be used to measure the gap between the unsupervised learning outcomes and the subjective recognition results. Specifically, the higher (tending to 100%) the three indices are, the closer the unsupervised learning outcomes to the subjective recognition results.

Panel (a) of Figure 14 is the distributions of the indices (CA, ARI, and NMI) on the RD2 and PRD2 datasets. The performances of the models are presented in the same order, regardless of the index or dataset. The ranking of the models was DEC-AAEC-AAEKC-SAECC-AAECC, where DEC had a clear advantage over AAEC and AAEKC. This ranking differed from Step (i) and Step (ii) in the ranking of DEC and AAEC. In addition, even the model that performed best (DEC, with

C A = 0.82

,

A R I = 0.77

, and

N M I = 0.80

) still showed a large gap when compared to subjective recognition. From these two results, it could be seen that people’s understanding of the driving mode was certainly different from the results obtained by machine learning, and the cause of this phenomenon might be the insufficiency of people’s understanding of the MDRTs (because DEC, which had worse indices than AAEC in Step (i) and Step (ii), outperformed AAEC in this step) or failures existing in the labeling processes.

The result of Step (iv) also indicated the necessity of using integrated trajectory data. Picking the two top-performing models (i.e., DEC and AAEC), we present slices of the indices in (b) and (c) of Figure 14. We find that the indices of RD2 were better than those of PRD2. In the process of labeling data, we did not require that experts limit the type of information to be referenced; experts may only label the modes based on experience or based on some technical indicators. From the results shown in Figure 14, the performances of the models using integrated trajectory data were better than those using pure GPS data. This indicates that in processing during daily operation, experienced drivers would decide their own driving modes based on comprehensive information, not simply on time and space information.

3.3. The Relationship between Discovered MDRT and Railway Carrying Capacity

After discussing the performances of the five proposed models in conducting unsupervised MDRT learning tasks, we provided further discussions about the obtained MDRTs themselves. From the 13 obtained modes, we picked two modes (i.e., mode (c) and mode (e)), illustrated in Table 22. Using the analysis method from [6], we analyzed the impacts of these two modes on railway carrying capacity.

Mode (c) adopts the strategy of first accelerating and then decelerating: Rapid acceleration after exiting the station, reaching the speed limit of 44 km; decelerating in advance, with smooth operation upon entering the station; and finally no re-acceleration. Mode (e) accelerates slowly when exiting, reaches the speed limit of 45 km, cruises longer, and brakes later. The braking force is large, and there is re-acceleration when entering the station. The main phases of the operation of the two modes are shown in Table 23. The two modes had different impacts on the carrying capacities of the two stations and the section. The first mode produced a smaller occupancy of the station departure capacity but achieved a greater occupancy of the section carrying capacity and the station arrival capacity. The second mode, on the other hand, achieved a larger occupancy for the station departure capacity but a smaller occupancy for the section carrying capacity. In daily operation, the station-section-station carrying capacity adjustment was realized through these types of flexible behavior combinations. This is also the hidden nature that had been noticed previously; however, it lacked quantitative analysis.

4. Conclusions

In this paper, we proposed an unsupervised deep learning approach consisting of five models to learn MDRTs from integrated trajectories of in-operation railway trains. From the results of the four steps of the designed experiment, we drew the following conclusions: (i) The proposed deep learning models outperformed the classical models by 27.64% on average. AAEC achieved the best performance on balanced and unbalanced datasets according to the results of Steps (ii)–(iii). This may be related to the design idea of implicitly expressing the sample distribution. (ii) Integrated trajectory data could improve the accuracy of unsupervised learning by approximately 13.78%. (iii) Under the index system with labeled data and the index system without labeled data, the performance rankings of the proposed models were different, demonstrating the insufficiency of people’s understanding of existing modes. In addition to the learning performance analysis, we also conducted a simple analysis of the relationship between modes and railway carrying capacity.

The discovery of driving modes will help us to understand the structure of the subjective uncertainty facing railway systems and thus guide capacity utilization optimization and automatic train driving algorithm design. The proposed approach was applied to the capacity utilization optimization and new driver training of the Baoshen railway. The approach enabled the construction of a MDRTs collection whereby a multivariate parameter set of running times was obtained so that operators could make distinct plans for different situations. New drivers are also now able to learn different modes and enhance their ability to address different situations. In the course of this research, we found that the model did not work well on unbalanced datasets. This is also one of the problems that we will overcome in the future.

Author Contributions

Conceptualization, Han Zheng; Data curation, Han Zheng; Resources, Han Zheng; Software, Zanyang Cui; Supervision, Xingchen Zhang; Visualization, Han Zheng and Zanyang Cui; Writing—original draft, Han Zheng; Writing—review & editing, Han Zheng.

Funding

The research was supported by the National Natural Science Foundation of China (Grant No. U1734204).

Conflicts of Interest

The authors declare no conflict of interest.