This section outlines our methodology to assess the impact of trackwork on train delays in Sweden, using two regression analyses: multiple logistic regression and negative binomial regression. The section begins by presenting the datasets obtained from the Swedish Transport Administration covering the Swedish railway network in 2017 [
33]. The data preparation process involves combining and structuring this data to make it suitable for regression analysis. Following this, we describe our use of multiple logistic regression to analyse the probability of train delays in relation to trackwork and other factors. Then, we explain the application of negative binomial regression to examine the frequency of these delays. Both methodologies are chosen for their effectiveness in handling the complex nature of our dataset and their relevance to railway operations analysis.
2.1. Overview of Data
The first dataset comprises the trackwork records from the track utilisation plan, detailing 225,507 instances of scheduled trackwork. Each record provides specific information about the scheduled time, location, and the restrictions imposed on train traffic due to maintenance. Our study focused on basic maintenance trackwork, which is characterised by the absence of full track closures and a duration of less than 24 hours.
In the track utilisation plan, locations of trackwork are identified by unique signal numbers situated along the track segments that span between two designated stations, marked as Ss and Se in
Figure 1. Out of the 225,507 trackwork activities listed for 2017, we identified 3218 distinct track segments, which may include up to nine intermediary stations. Within these segments, the plan records a set of smaller trackwork that is performed at the same time in the same area. To streamline our dataset, we merged overlapping activities into single records, thereby eliminating duplication and simplifying the dataset for analysis. As a result, adjacent trackwork events, such as those depicted in
Figure 1 as Ss.1–Sn.1 and Sn.2–Se.2, were combined into consolidated entries, labelled as trackwork 1–2 in the figure.
The second dataset comprises the train punctuality data, extracted from the train plan 2017. This dataset provides information about the scheduled departure/arrival time and actual departure/arrival time to each station on the assigned train path, with a time precision of one minute. It includes specific details for each train route, such as a unique identification number, the type of train, and the type of track (whether single, double, or quadruple). In total, this dataset captures 32,591,482 train observations (
Figure 2). Each recorded train passage is captured as a sequence of stations along its route, providing a precise geographical profile compared to the trackwork dataset (
Figure 1). To integrate the datasets, we matched each unique journey in the punctuality records with corresponding track segments between the start (Ss) and end (Se) stations on the route. Given that trains traversed numerous segments or bypassed them entirely on their routes, 32.6 million recorded journeys throughout 3218 designated segments comprised roughly 27.2 million distinct train passages (
Figure 2).
Following this, we prepared the datasets for analysis with two regression models: multiple logistic regression and negative binomial regression. For the logistic regression, we defined two additional variables to capture both the presence and absence of train running time (runtime) delay increases, without altering the overall number of observations (
Figure 2). In contrast, for the negative binomial regression, we aggregated the data based on a unique mix of train type, track type, trackwork, train entry status, daytime, and location. We then grouped the dataset with three new variables to quantify the counts of train running time delay increases, decreases, and instances where delays remained constant.
Table 1 shows a summary statistic of trackwork duration and train delay size. On average, the trackwork lasted for 181 min but had a large range and a standard deviation of 207 min. The running time delay was calculated as a difference between the scheduled and actual train running times between analysed stations. The measurements were conducted with a precision of up to 1 min. The mean value of the observed train running time delays is -0.15 min, and a standard deviation of 5. The range of delay times spans substantially, with the earliest arrivals recorded at minus 444 min, and the maximum value 1447 min.
The analysed 27.2 million train passages have the following characteristics presented in
Table 2. The count of the trains was evenly distributed over 12 months in the year 2017, with an average count of 2.3 million train passages per month.
Table 2 shows the following characteristics of analysed train passages: train subtype, track type, running time delay, trackwork, train enter status, and day time. Each category of these variables is listed, along with the percentage of observations per category, and reports delay-increase observations within four thresholds (1–4 min, 5–9 min, ≥10 min, and ≥1 min). Notably, among all categories, freight trains most frequently faced increases in running time delays. In contrast, when passing the analysed section, commuter trains were less prone to such delay increases. Instead, these commuter trains predominantly experienced reductions in running time delays during the period of study. Scheduled trackwork overlapped with about 0.4% of the train passages, whereas 99.6% of the passages did not pass through scheduled trackwork. 10% of the train passages were on quadruple-track, 52% on double-track, and 39% on single-track. Our sample was composed of 81% passenger trains and 19% freight trains. In total, 29% of the train passages in our sample were ahead of schedule entering the analysed track section, and 43% were behind schedule. Interestingly, trains that entered the section ahead of schedule often encountered a subsequent increase in running time delay. Finally, 86% of the passages occurred in the daytime and 14% at night. Night-time was defined (according to the labour act of Sweden [
34]) as the period between 22.00 and 06.00. The total count of observations in the sample is 27,182,178.
2.2. Regression Modelling
In this study, we analyse how train running time delay and delay recovery (attributed to delay decrease) are associated with trackwork. The control factors are train type and subtype (passenger or freight train, with subtypes of each) and train entry status (early, late, on time) to the analysed track segment. Track type and day time are control variables for the trackwork relevant to this study’s context. We develop two types of regression models: (i) Multiple logistic regression to explore the probability of train running time delay, and (ii) negative binomial regression to explore the frequency of train running time delay affected by the presence of scheduled trackwork. In addition to the main models, which account for more or equal to a 1 min train running time delay, we have also performed a sensitivity analysis regarding different running time train delay thresholds, accounting for delays of more than 5 or 10 min.
Table 3 provides a comprehensive statistical summary of the response variables used in both the logistic and negative binomial regression models. For the logistic regression model, we consider running time delay increases and decreases of at least one minute, with the observations totalling 27,182,634. Within this model, the average instance of delay increases of at least one minute is noted as 0.22, with a standard deviation of 0.42. The mean for delay decreases of the same threshold is 0.45, reflecting a higher frequency of delay decreases with a standard deviation of 0.50. The sensitivity of the model to more substantial delays is also examined, with thresholds at five and ten minutes, revealing lower average instances, signifying fewer occurrences of longer delays.
The negative binomial regression model is employed for count data, chosen due to the over-dispersion present in the delay counts. The variables for this model are aggregated counts by trackwork, track type, train subtype, train enter status, and day time and location (
Figure 2), with a total of 406,563 observations. The response variable running time delay increase/decrease count is a count variable representing the number of increased/decreased delays in the running time for each train passage in the studied track segment. The count of running time delay increases of at least one minute shows an average of 15 with a standard deviation of 44, indicating variability in delay occurrences. For running time delay decreases of one minute or more, the mean count is 30, with a higher standard deviation of 100, suggesting a wider spread in the data. Sensitivity analysis for this model includes delay increases at five- and ten-minute thresholds, with 142 and 86 instances, respectively, reflecting a marked decline in counts as the delay duration increases.
2.2.1. Multiple Logistic Regression
We use a multiple logistic regression model to analyse the effect of trackwork, along with other explanatory variables, on the train running time delay increases (1)/decreases (2). Logistic regression is commonly used to study functional relationships between a categorical dependent variable and one or more independent variables [
35,
36]. The response variable for the first model captures the presence and absence of train running time delay increase while passing an analysed track segment, coded as 1 and 0 accordingly. In the second model, the response variable reports the presence and absence of train delay decrease in the same circumstances coded as 1 and 0 accordingly. The multiple regression model predicts the train running time delay increase/decrease (
Y) occurrence by the explanatory (
) variables described in
Table 2. The summary of this model is presented in the equation:
where:
Y is the response variable capturing the presence or absence of the train running time delay increase (1 min) for the first model and of running time delay decrease (1 min) for the second model, given the predictor variables. The possible values are 0 or 1;
are the predictor variables in the model (trackwork, track type, train subtype, train enter status, and day time, respectively);
is the intercept term, and are the coefficients for each predictor variable.
The explanatory variable trackwork is a binary variable where 1 is assigned to cases where the train passage on the studied track segment overlaps with scheduled trackwork; otherwise, it is: 0. Track type, train type, train enter status, and night are categorical explanatory variables representing the track type, train subtype, whether the train is on time, early, or late, and whether the train operates at night, respectively. The time variable shows when the train passed the analysed line day (0) or night (1). Pearson’s chi-squared test was used to check the independence of qualitative variables entering the regression model. The results show that all tested variables were independent. The selection variables chosen for this model were made by testing several logistic models.
For ease of interpretation, in line with multiple logistic regression coefficients, we computed the odds ratio (OR). OR is a measure of association between a given exposure in a logistic regression and an outcome
Y:
The OR, therefore, indicates how much more likely the event is to happen given a particular exposure (in this case, trackwork) compared to its absence. An OR greater than 1 suggests a higher likelihood of the event when the exposure is present, whereas an OR less than 1 indicates a reduced likelihood. This measure is particularly useful in logistic regression as it provides a clear and interpretable metric of the strength and direction of the association between predictors and the outcome variable.
2.2.2. Negative Binomial Regression
We employed two negative binomial regression models to analyse the relationship between the count of train running time delay increases (1)/decreases (2) and a set of explanatory variables. The regression coefficients were estimated using the
glm.nb function in R (2023.06.2). The equation for the model is as follows:
where:
is the expected count of running time delay increase (1 min) for the first model and of running time delay decrease (1 min) for the second model given the predictor variables;
are the predictor variables in the model (trackwork, track type, train subtype, train enter status, and night, respectively);
is the intercept term, and are the coefficients for each predictor variable.
is the natural logarithm of the exposure variable for observation.
For ease of interpretation, in line with the coefficients obtained from the negative binomial regression, we computed the incidence rate ratio (IRR) by taking the exponent of the estimated coefficients, which is expressed as . This allows us to directly interpret the proportional change in the count of running time delay increases or decreases associated with a one-unit change in the predictor variable, with all other variables held constant.