Next Article in Journal
Obstacle Indicators Diagnosis and Advantage Functions Zoning Optimization Based on “Production-Living-Ecological” Functions of National Territory Space in Jilin Province
Next Article in Special Issue
Identifying the Causal Relationship between Travel and Activity Times: A Structural Equation Modeling Approach
Previous Article in Journal
Electric Vehicle Charging Station Location Model considering Charging Choice Behavior and Range Anxiety
Previous Article in Special Issue
An Application of a Deep Q-Network Based Dynamic Fare Bidding System to Improve the Use of Taxi Services during Off-Peak Hours in Seoul
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Investigating the Potential of Data Science Methods for Sustainable Public Transport

Institute for Ubiquitous Mobility Systems, Hochschule Karlsruhe University of Applied Sciences, 76133 Karlsruhe, Germany
*
Author to whom correspondence should be addressed.
Sustainability 2022, 14(7), 4211; https://doi.org/10.3390/su14074211
Submission received: 8 March 2022 / Revised: 28 March 2022 / Accepted: 30 March 2022 / Published: 1 April 2022

Abstract

:
The planning and implementation of public transport involves many data sources. These data sources in turn generate a high volume of data, in a wide variety of formats and data rates. This phenomenon is reinforced by the ongoing digitization of public transport; new data sources have continuously emerged in public transport in recent years and decades. This results in a great potential for the application and utilization of data science methods in public transport. Using big data methods and sources can, or in some cases already does, contribute to a better understanding and the further optimization of public transport networks, public transport service and public transport in general. This paper classifies data sources in the field of public transport and examines systematically for which use cases the data are used or can be used. These steps contribute by structuring ongoing discussions about the application of data science in the public transport domain and illustrate the potential of the application of data science for public transport. We present several use cases in which we applied data science methods, such as machine learning and visualization to public transport data. Several of these projects use data from automated passenger information systems, a data source that has not been widely studied to date. We report our findings for these use cases and discuss the lessons learned, to inform future research on these use cases and discuss their potential. This paper concludes with a summary of the typical problems that occur when dealing with big public transport data and a discussion of solutions for these problems. This discussion identifies future work and topics worth investigating for public transport companies as well as for researchers. Working on these topics will, in our opinion, support the improvement of public transport towards the efficiency and attractiveness that is needed for public transport to play its essential role in future sustainable mobility. The application of these methods in public transport requires the collaboration of domain experts with researchers and data scientists, calling for a mutual understanding. This paper also contributes to this understanding by providing an overview of the methods that are already used, potential new use cases, data sources, challenges and possible solutions.

1. Introduction

Public transport companies are confronted with major challenges. As a climate-friendly alternative to the private car, great hopes are pinned on public transport to tackle the climate crisis by providing a more sustainable means of transport. Germany, for example, wants to double the number of public transport passengers by 2030 compared to 2010, according to its climate action plans. Achieving this goal will only be possible through large investments, an increased attractiveness and optimized planning. While investments in public transport are the responsibility of the government at the federal and regional levels, increased attractiveness and optimized planning are goals that can also be supported by research and the development of new technologies. Specifically optimized planning requires accurate information about the demand, operation, and optimization possibilities of public passenger transport. The planning and operation of public transport already involves a multitude of data sources. The ongoing digitization of public transport makes new data sources accessible and has greatly increased the amount of data available in recent years and decades already. These data sources, in turn, generate a high volume of data in a wide variety of formats and data rates. However, many challenges in public transport planning at the moment stem from a lack of coherent information and transport companies are currently not able to exploit the full potential of their data. The application of data science methods to the vast amount of data in public transport could be the key needed to fill the information gap and to provide a foundation for the further expansion of public transport. In this paper, we present our systematic approach to recognize and tap into this potential. We present our approaches to utilize public transport data and discuss the lessons we learned. We hope to demonstrate that the public transport domain offers numerous valuable use cases for data scientists to explore as well as to clarify the potential of data science methods for representatives of the public transport domain. We identify challenges for the application of data science on public transport data and propose solutions to highlight a path towards data management in public transport that enables the efficient and successful application of data science methods. We are firmly convinced that these developments can improve public transport and greatly contribute to develop public transport into an even more important pillar of sustainable mobility in the future.
First, we identify and categorize the data sources generally available to transport companies and review several approaches towards data collection and data analysis for public transport. Section 2 describes these categories and data sources, as well as related work for data science methods in these data source categories. Based on this classification, we then provide an overview of the use cases that transport companies can fulfil by analyzing these data. We interview representatives of public transport associations as well as representatives from several companies developing software for public transport about the use cases for which they envision utilizing their data and about their current problems that stem from a lack of coherent information. In Section 3, we discuss the use cases we identified as well as recent approaches to these use cases from the literature. The resulting classification can structure the discussion of how and where to apply data science methods in public transport, specifically discussions in the public transport domain itself. In our current and ongoing work, we use data science methods to explore the application of public transport data to some of the use cases. We report on the use cases that we explored and describe our findings in Section 4. In Section 5, we describe the difficulties we encountered and discuss how the characteristics of public transport and data sources in public transport entail challenges for data science and the application of machine learning methods. Based on these challenges, we discuss open research questions and future work as well as organizational challenges for public transport companies as solution approaches. Section 6 concludes this paper with a summary and outlook. The solution approaches discussed lay out a path for the improvement of public transport.

2. Categorization of Data Sources in Public Transport

To understand data sources in public transport and to identify which data sources are currently einvestigated and used, we interviewed representatives of public transport agencies and reviewed three meta-studies investigating those data sources. In their meta-study, Maria Karatsoli and Eftihia Nathanail looked at 69 studies from the field of transport research [1]. Of these studies, 14 deal with public transport issues. Khatun E. Zannat and Charisma F. Choudhury limited their analysis to studies that used big data in the field of public transport planning [2]. They examined a total of 47 publications.
Timothy F. Welch and Alyas Widita evaluated 81 publications [3]. They focused on data sources in the field of public transport.
To categorize the data sources, we used the viewpoint of public transport agencies and identified the systems that that produce data as main categories, as displayed in Figure 1 and listed below:
  • Automated Fare Control (AFC) Systems
  • Automated Passenger Count (APC) Systems
  • Vehicle Sensors and Systems
    Vehicle Sensors
    Automated Vehicle Location (AVL) Systems
    Vehicle Condition Sensors
    Vehicle Software Systems
  • User’s Mobile Phone
  • Social Media
  • Automated Passenger Information (API)
For some of these types of systems, we identified subcategories, such as the physical and logical category of Automated Vehicle Location Systems. Figure 1 also shows some examples of specific variants regarding the implementation details of these systems. These are also discussed in the following paragraphs.
An Automated Fare Control (AFC) system enables automated ticket sales, ticket validation, and inspection. There are different types of systems. One form is that the passenger actively logs in and out of the system at the beginning and end of the journey (check-in/check-out). In another form, the system automatically registers the start and end of the passenger’s journey, for example, via communication between a radio beacon and the user’s smartphone (be-in/be-out). Furthermore, any combination of both systems can be used (e.g., check-in/be-out). Based on the duration and the start as well as end point of the journey, the fare is determined automatically. An AFC System can be implemented using a wide variety of media. Currently, smartcards seem to be the most common medium. As we observed in all three meta-studies, most of the papers handling AFC data deal with the analysis of data collected by means of smartcards. Using AFC data, one can obtain a good impression of passenger’s actual movement in public transport. If a public transport association uses no Automated Fare Control system, the actual movement of passengers and therefore the real load on a public transport system must be determined another way.
An Automated Passenger Count (APC) system is a system that automatically records the number of passengers aimed at determining the real load of a public transport system. Several types of APC exist; turnstiles can be used to detect passengers boarding and alighting at stops. This method requires a fully fenced traffic system, which is rare. A rough estimate of the passenger volume can also be achieved by weighing the vehicles [4]. More accurate passenger counts can be accomplished using infrared or laser barriers on the inside of vehicle doors [5]. Video and depth cameras can also be used for automatic count passengers [6].
Public transport vehicles record a wide variety of data. These are in the category Vehicle Sensors and Systems, which is split into the subcategories Vehicle Sensors and Vehicle Software Systems. Data from software systems in a vehicle include data from the on-board computer, the communication module or the passenger information displays, for example. Sometimes, sensor data are logged in vehicle software systems, such as the on-board computer. It can be logged as raw data or already interpreted data and are often complemented with other data, such as the line and stop sequence the vehicle is driving.
Vehicle sensor data can be roughly categorized into two categories. Sensors are used to monitor the condition of the vehicle or to locate the vehicle. Examples for Vehicle Condition Sensors are sensors for the oil level, or for the condition of the oil filter [7].
Automated Vehicle Location (AVL) systems are used to determine the location of a vehicle during operation automatically. In public transport, a distinction is made between logical and physical location procedures [8]. Logical and physical positioning can also be used complementarily. Logical positioning takes advantage of the fact that public transport is usually organized as a regular service. This means that the position, starting from a defined starting point, can be recorded based on the distance traveled and can be logically deduced. The distance traveled can be determined, for example, by means of an odometer via wheel rotation. Physical positioning is possible, for example, via infrared markers at bus stops. These can be registered by vehicles, whereupon the position of the stop can be assigned to the vehicle. Probably the most common form of location determination today is using a satellite navigation system via GPS (Global Positioning System). Relevant for the usage of AVL data is the frequency in which vehicle positions are recorded as well as if and how the vehicle positions are reported in real time. Both properties may significantly vary for different public transport systems and sometimes even within one public transport system, for example, because of differences in software and/or hardware in vehicles of different modes of transport.
Another possible data source for public transport is the User’s Mobile Phone. Most public transport passengers use a mobile phone, which is why information about the passengers and their movement can be concluded from mobile phone data. Data about or from a mobile phone can be accessed either using the cellular infrastructure, Wi-Fi or Bluetooth. Mobile phone data are, for example, recorded and can be provided by telecommunication companies. By using the triangulation of the measured received signal strength and the signal transmission time between a mobile phone and the base station, the location of a mobile phone device can be determined. The spatial resolution can be improved to a few meters if three or more base stations are within range [9]. Thus, for example, changes in the location of passengers can be accurately detected. This data source extends beyond the scope of public transport, but can also be analyzed specifically targeting public transport usage. However, the analysis of mobile phone data is often limited due to data protection and licensing problems [10]. Therefore, it is often difficult for a transport company to obtain these data from telecommunication companies. In this case, the transport operator has the option of using Wi-Fi or Bluetooth sensors as radio beacons. These sensors can detect the passenger’s approximate position by communicating with the passenger’s mobile phone device [10]. Similar to mobile phone data computed from signal strength, the accuracy can be improved by using multiple sensors in combination with triangulation techniques.
Social Media has also become a data source harvested for public transportation in recent years. For analysis and research, social media networks usually make their data available via application programing interfaces. However, there can be major differences in the extent to which the networks make their data available and at what cost. Since the various social networks all have a similar purpose but can differ greatly in their range of functions and interaction options, the data generated through them are correspondingly diverse. As a result, the data from the various social networks are differently suited for the application areas in public transport, as shown in the literature review of Nikolaidou and Papaioannou [11]. Their research also indicates that Twitter data are currently the most widely used social media data source in public transportation research.
In addition to the data sources discussed by the above-mentioned meta-studies, we introduced one additional possible source of data in public transportation: Automated Passenger Information (API). Automated passenger information systems allow passengers to plan their trip and be always well informed about the public transport network and their planned or current journey. A good example for possible data that are generated by API are electronic route planning systems. An electronic route planning system enables travelers to retrieve journey options and information about journeys. For this purpose, the user specifies at least the origin, destination and the desired departure or arrival time of their intended journey. Based on this information, the system can then calculate possible routes and display them to the user. A route planning system does not necessarily have to be limited to one means of transport, but can also include multimodal information. The data that a route planning system stores can be distinguished into two datasets. The first dataset contains the requests of the users to which we further refer as route requests. A route request contains, for example, the requested origin and destination as well as the desired departure or arrival time of the user. A more detailed list of data contained in a route request is given in Section 4. The second dataset contains the results calculated by the routing system, i.e., the possible connections, based on the user’s entries. We further refer to this dataset as route responses. Typically, several possible routes are found and calculated for each route request. Likewise, a possible route often consists of several legs that the passenger has to cover in order to reach their destination. Figure 2 shows the difference between route requests and route responses. In this case, one possible route is found that consists of two legs.

3. Use Cases and Related Work

The previous section showed that there is a large variety and quantity of digital data that are generated and can be harvested for data analysis in public transport. Similarly, there are numerous and various use cases that can benefit from a deeper analysis of public transport data. We systematically collected and structured these use cases to understand better the information need and the potential of big data analysis approaches in public transport. For our analysis, we conducted several workshops and discussions with public transport agencies and operators as well as with companies developing several types of software for public transport operation. We organized the emerging use cases in two dimensions and discuss them in this section.
Table 1 shows the result of our use case analysis in two dimensions: time and types of tasks. The timing of data analysis and usage greatly influences the methods that are applicable. The time dimension is organized as follows: data can be analyzed to immediately assess or manage the current situation. Use cases aiming at current events or the current situation require real-time capable methods and infrastructure. Other use cases utilize data analysis for the short-term future, to support decisions for the same or the next day or week. In this case, the analysis is not needed in hard real-time, but still can be time sensitive. Data can also be analyzed for the support of planning in the medium-term future, ranging from weeks or months to up to a year. In public transport, there are also use cases in long-term planning that affect decisions for several years in the future. Those use cases allow the application of time-intensive methods.
In addition to ordering use cases after the time frame, the second dimension addresses the type of tasks that can be supported. For varying application domains, different types of data are needed. This dimension includes monitoring and planning the public transport network, all use cases considering the timetable, applications for public transport vehicles, knowledge about public transport passengers and their behavior and passenger information.
Managing and developing the public transport network are core tasks of public transport agencies and there are many use cases for data science for tasks concerning the public transport network. Analyzing public transport data using data science methods can support operators in determining and assessing the current situation in their network [12,13]. Big data methods can also be used to evaluate the network’s performance by monitoring and analyzing the demand and actual passenger count, delays, and disruptions [14]. Using historical data, patterns of broken connections can, for example, be analyzed to reveal flaws in the timetable and network plans [15]. Duty scheduling can also be optimized. Depending on available sensor and other relevant data on infrastructure elements and vehicles, the predictive maintenance of the infrastructure can improve the longevity of these elements [16,17,18,19].
Looking at the medium-term or long-term future, the planning of lines and overall network planning, including new stations and stops, can be informed by a careful analysis of long-term historical data and a prognosis of demand [20]. The support of planning also extends to multimodal planning, taking other modalities into account. Approaches analyzing bike sharing exist, for example [21]. By complementing such an analysis with public transport data on demand and usage, multimodal planning and multimodal passenger information can greatly benefit. Additionally, trip planning services can incorporate knowledge about sharing vehicles and their usage to support multimodal trip planning. On another level, infrastructure planning can also benefit from a deep analysis of sensor data, usage data and predictions of demand, for example.
Several types of analysis can benefit timetable management and service planning. The detection of delays and disruptions as well as a prediction of their effects in real-time can support rescheduling decisions and mitigating actions [22]. In some types of events, replacement services must be organized as soon as possible. A prediction of demand can provide guidance for implementing such replacement services [23]. Since disruptions and to some extent delays are unexpected events, handling those events can benefit from available real-time data [24]. These use cases also depend on real-time capable methods to react quickly to unexpected events. Predicting demand helps to manage on-demand services more closely and optimize resources, for example, minimizing the number of vehicles that are on call for on-demand transport.
The analysis of broken connections as mentioned above can inform the development of new timetables and a more detailed analysis of the public transport demand can be used to optimize the frequencies of lines. Examining the demand of passengers and their actual public transport usage can be used to plan on-demand services, especially when they are supposed to replace existing services that generate such data.
Malfunctions of public transport vehicles can be prevented using predictive maintenance, based on vehicle data, either in real-time or in short-time periods [25,26,27]. In the case of electric buses, machine learning analysis can be used to plan and optimize the charging of vehicles [28,29,30]. Vehicle capacity planning can also be supported by the analysis of data on demand and passenger count data.
The core of public transport is to transport passengers. Yet, very often, not much is known about these passengers. There are some data sources that can be used to gain more knowledge about passengers, their whereabouts, goals, and behavior. Considering real-time analysis, it is an important use case to predict passenger numbers in vehicles or at stops. Especially in times of the COVID-19 pandemic, passengers want to avoid vehicles that are too full, but other than that, full vehicles are also a source of discomfort passengers want to avoid. For operators, vehicles that are often too full imply that vehicle capacity should probably be re-planned. Predicting and estimating passenger numbers can be achieved using big data and machine learning [31,32]. Based on similar data, passenger flows can be re-directed, for example, in the case of big events, very full vehicles or, on a smaller scale, to optimize boarding and minimize the time vehicles spend at stops.
Mobility behavior can be modeled based on big data, too [33,34,35]. Specific models for public transport usage can support planning and evaluation, but also can be useful for planning replacement services in the case of disruptions or construction, for example. The analysis of passenger data can also be used to secure connections by analyzing the frequencies and popularity of connections or boarding times [36,37,38,39]. A prediction of public transport demand specifically can be useful for several usages, from the planning of infrastructure, stops and lines long term to the planning of on-demand services [12,13,40,41,42]. Lathia and Capra analyzed smartcard data to measure travel behaviors and enable transport operators to manage incentives for behavior change [43].
Finally, precise and timely passenger information is crucial for a good public transport experience and therefore to increase the attractiveness of public transport. Big data can help to provide passenger information in real time, complementing trip information with information about vehicle occupancy, providing precise information about delays and their expected development as well as providing timely information about disruptions, including trip alternatives, for example. At the same time, data analysis can provide a basis to personalized passenger information, identifying mobility patterns of user groups and tailoring information to user’s mobility preferences, for example. Using knowledge about the user’s preferences, behavior and trips, critical information can be provided ahead of time.
This discussion of use cases for big data methods on public transport data demonstrates that these methods have the potential to improve public transport in a variety of applications.

4. Applying Data Science Methods to Public Transport Data

In our work, we explored use cases to optimize public transport using data science methods, specifically machine learning. Table 2 lists all projects, the used datasets and methods as an outline.
In this section, we discuss several of our projects investigating the application of data science methods to public transport data and present our insights.

4.1. Project 1: Visual Analytics for Public Transport

Use Case: Data from Automated Passenger Information systems (API) has not been widely used for data analysis in public transport yet. However, first attempts at analyzing route requests to passenger information systems have shown that route requests correlate with real transport demand [44]. Passenger behavior has been analyzed using data from passenger information systems, for example, for extreme weather events [45,46]. Our goal is to explore the potential of API data further. To understand this potential and to develop a basis for discussion of the data and its potential with domain experts, we first explored visualization and visual analytics. We were interested in how the visual analysis of route requests can support network analysis and the analysis of the public transport demand. Visual analytics have been used successfully for the analysis of public transport data before, for example, using data from public transport vehicles, in an approach by M. Wörner and T. Ertl [47]. We investigated if insights from visualizing passenger requests can form a basis for further optimization and planning. Another goal was to determine which visualizations are suitable for this type of data for data scientists and for domain experts as well.
Data: As described above, we worked on data from automated passenger information systems. In this case, a dataset of route requests received by the KVV (Karlsruher Verkehrsverbund, the transport association in the Karlsruhe region, Germany) was made available. The period of the data set was between 1 January 2019 and 10 October 2019 and contained over 18 million requests. This is a good example for the scale of the data that are produced in the operation of public transport. Each request contained the information shown in Table 3.
Methods: We decided to develop an interactive visualization dashboard that allowed the flexible configuration of visualizations. Due to the possible large data volume, our goal was to allow the dashboard to be operated on different screen sizes. For overview, the application should be usable on our display wall using eight curved displays in two rows. Additionally, we implemented several options to select data subsets and to filter the data for each visualization. The application was realized as a web application using Dash [1]. We aimed at visualizing the requests based on their temporal and spatial distribution and therefore started to analyze the temporal and spatial information in the data to determine preprocessing steps. The geographic coordinates of the dataset were given as decimal numbers in the WGS-84 geographic coordinate system. A particularity arises from the fact that, due to the properties of route requests, the current location of a user can be in the query data. Data that contain private information of the user must be handled very carefully to ensure the user’s privacy and data protection. To comply with legal requirements concerning data protection, the public transport association truncated all coordinates to two decimal places, before submitting the data to us for further analysis. This causes a possible deviation of the position by about 1 km. For a geographic analysis in the inner-city area, this accuracy is insufficient. In this work, the data were analyzed on a general and not a personal level. For these two reasons, we revalued the coordinates to at least five decimal places, using additional location information. A total of 93% of the connection searches included stations as sources and/or destinations.
The station IDs were defined by the IFOPT (Identification of Fixed Objects in Public Transport) specification. This circumstance allowed a fast and straightforward mapping with another dataset that contained the stops and their respective more precise coordinates. For the rest of the coordinates, we used the geocoding tool Nominatim [2] from OpenStreetMap. Figure 3 displays the result of this approach, showing the spatial distribution of requests and how their precision could be improved. Based on this preprocessed data, we developed an interactive dashboard for the analysis of passenger information data. Several types of analysis and visualizations can be explored on this dashboard. Various settings parameterize the data analysis. The user can choose different temporal settings, e.g., select the time frame or certain days of the week to analyze. In addition, the applications used for the query can also be selected. Individual stations can be selected and thus be examined more closely. The data are visualized in several graphs.
Two pie charts show the distribution of requests among the individual applications and user agents. A bar chart shows the number of requests that mention a point as either the origin or the destination. This chart helps to identify the most requested stops, for example, to choose stops for further analysis. A line chart shows the spatial distribution of the request for the selected time. Another line chart shows the relative frequency of daily requests per weekday, displayed in Figure 4. The graph is interactive and allows to select the displayed weekdays and examine the data points more closely. A heat map and a scatter map illustrate the spatial distribution of the requests. The origin-destination relations of the requests are represented by a sankey diagram, shown in Figure 7. Sankey diagrams have been used to analyze public transport data before, for example, by W. Zeng et al. [48]. The nodes of the sankey digram represent the number of times an origin (left node) or a destination (right node) is given. The length of a node represents the sum of incoming or outgoing connections. The edges connect the origins and destinations with each other and represent the number of connections of each relation by their width. This diagram helps to analyze frequently requested connections and supports a closer analysis of the efficiency of the public transport network.
Results and Future Work: The result of this work is a web application dashboard that can be displayed and operated on a display wall using several screens, but can also be used on regular screen sizes. Figure 5 shows the application on our display wall consisting of eight separate displays. It allows the visualization of large data volumes, utilizing the high resolution of eight separate displays. The dashboard is configurable, so that a user can choose which graphs should be displayed, as shown in Figure 6. A properties tile in the dashboard can be used to choose the subset of data that should be displayed and to filter the data. Considering our goal to assess, if conclusions about the data and its further analysis can be made based on the visualizations, we found that usage patterns could be revealed in the data.
These usage patterns indicate that the request data reflect the actual demand for public transport. Morning and evening peaks are, for example, clearly visible in the frequency of requests, as can be seen in Figure 4.
In addition, the peaks at weekends are significantly flatter than on weekdays and are shifted back in time, as they are in analyses of passenger numbers. The usage of trip requests is significantly lower during school and semester breaks due to the absence of school traffic. Such known usage patterns are apparently represented in request data, which links with actual demand. However, the visual analysis also revealed that a considerable amount of the queries are automated queries. These queries, in turn, do not represent a real demand. This is illustrated in Figure 7, for instance. The figure shows a section of the origin–destination sankey diagram. The top connection was requested many times and more frequently than the others. This most frequent connection leads to the stop of the local university. The second connection in turn leads to the city’s main railway station. Although there is certainly a significant demand for university, the discrepancy between the ratios is clear. Such discrepancies can be explained by automatic requests from bots or other applications that make specific requests in frequent time intervals. Some of these requests can be excluded from analysis by excluding all requests to the endpoint those applications use. However, there are other automated requests in the data that are made by applications and widgets, for example, that are not as frequent and not as easy to identify. Such automated requests distort analyses that focus on public transport demand. To use the data for further analysis or operational decisions, these bot requests must therefore first be filtered. We are currently pursuing several approaches to filter such automated requests.
Our visualization dashboard was tested in exploratory tests by students of transport management and we received positive feedback for our approach towards visual analytics of public transport data. In discussion with representatives from public transport agencies, visualizations proved to be crucial to convey data analysis results to public transport experts, which is why we continue to develop dashboards for visualizations of our data analyses. Future work using this dashboard application includes a user test to measure the usability of this application and a test with representatives from public transport agencies to assess its utility. Based on these tests, we aim at improving the application and iteratively integrating additional data sources.

4.2. Project 2: Analyzing Demand for On-Demand Planning

Use Case: As we saw in our visualization project, data from automated passenger information systems can be used to analyze transportation demand. In a project building on this realization, we wanted to pursue this approach. The objective of this project was therefore to help transportation companies to explore places and times where travel options are not yet sufficient to meet all travel needs. The first goal in this project was to analyze public transport coverage in general, based on API data. Based on such analyses, public transport companies discuss the extension and development of their network. However, apart from adding new railway lines or lines of trams or buses, public transport agencies also consider other types of services to provide travel options to their passengers in places or times that are currently underserved. A second goal of this project therefore was to discover and optimize regions for the implementation of on-demand services based on travel intention and travel behavior. Analyzing travel behavior and demand, for example, using smart card data and exploring similar questions to ours, can be found in works by M. Bagchi and P.R. White or by L.M. Kieu et al., for example [36,49].
Data: We used similar data as in the first project. In this case, however, the data were provided by the MVV (Münchner Verkehrs- und Tarifverbund, the transport association for the region of Munich).
Methods: User queries were clustered geographically to find focal points. As described before, the coordinates of a query are blurred to two decimal places to ensure the privacy of users. This already leads to a cluster grid, and each start or end point of a trip request is assigned to a cluster-point in this grid. One problem with the data is that, while route requests are stored, the suggestions a user receives in response to that request are not. To obtain these data, the requests were rerun for the time they were originally run. This has the disadvantage that the historical state of traffic is no longer accurately observed, since requests for a time in the past are processed on timetable data only, not considering real-time data. Then, it was examined how well a trip proposal matched the requested arrival or departure time, if there were any trip proposals at all. This allows an assessment of whether a trip request is well or poorly served by the existing travel services.
Focusing on two study regions, we analyzed the data to support on-demand transport planning in these regions. Heatmaps, time wheels and network spiders were used for visualization. Heatmaps are used for a spatial break down of the request priorities, as displayed in Figure 8. They are computed both for an entire period of time and for several points in time to display an animated temporal progression. The heatmaps show points where the requested routes are unsatisfactory. A route is unsatisfactory if the relation between travel time and actual trip time, meaning time spent in a vehicle, is greater than a given threshold, in the example 1,1. This indicates poorly coverage, because passengers spend a lot of time waiting for their bus or train. Time wheels, on the other hand, focus more on a temporal classification. In this case, in a view broken down by hours, request priorities can be quickly identified, as shown in Figure 9. Time wheels were created for an entire region as well as for individual spatial clusters. While these two visualizations only show start and destination points of a route request, these are connected with the network spider. A network spider, as shown in Figure 10, connects a starting point with various end points. In terms of our data, this means that all destinations (end points) requested from a single location (starting point) are displayed and connected. As an alternative, all request locations to a specific destination could also be displayed. This allows all outgoing and incoming connections to a selected cluster to be displayed. This is intended to make it easier to assess whether the desired connections are only local or more cross-regional.
Results and Future Work: During the evaluation, we identified a need for improvement in places without suburban rail connections. There are requests there that cannot be served at all or only very poorly by the existing public transport system. This means that potential passengers are more likely to choose an individual mode of transportation, provided they have an opportunity to do so. This also means that this individual mode of transport is often used for both the outbound and the return journey, even though there might have been a public transport service on one of the trips. In addition, places that are further away from the core region have poorer connections, which is clearly reflected in the analysis, and is a well-known fact. The analysis therefore needs to provide greater detail and more context to be of use for public transport agencies analyzing more rural regions. In the case of small towns, the problem was that they were not well represented by the usage of the blurred coordinates. At first, we utilized the blurred coordinates for clustering. However, since on-demand transport often involves door-to-door connections, it is necessary to use the most accurate data possible in future work. This is why we will investigate privacy preserving methods that still deliver precise results in future work. Future work will also utilize not only the routing requests but also the recommendations given by the automated passenger information system as historical data to allow for a comparison with current timetables.

4.3. Project 3: Predicting Passenger Numbers in Vehicles

Use Case: The previous project used data from automated passenger information systems for the analysis of travel demand in retrospect, planning for future services. We were also interested, however, in examining the short-term analysis of these data and wanted to see if we also could use route requests to reach conclusions about real-time travel demand. Using our insights from prior research, we wanted to explore if using data from the automated passenger information system could be used for a prediction of passenger numbers in a vehicle at a given time. We also wanted to know if such a prediction could be achieved in real time, based on current request patterns to then be included in passenger information. The prediction of vehicle occupancy has been explored in other works based on different types of data, for example, by Gilles Vandewiele et al. and by J. van Roosmalen [31,32].
Data: To enable our research, we needed datasets of observed passenger numbers and the generated route responses for the same period. For this project, we worked with three different transport operators in Germany. Two of the three companies were able to provide us with full passenger count data from their respective APC systems. The third company was not able to obtain permission to release the absolute number of passengers. Only aggregated figures could be provided. Both technical difficulties and data privacy concerns were expressed as the reason. Those aggregated figures were unsuitable for our methodology and were not used further. We therefore could use data from APC systems of two public transport agencies, together with route requests and the recorded route responses of the respective API systems. The APC data were recorded during the same time as the API data.
For every leg in the route response dataset, the following information was given: a proprietary public line identifier, the stop ID and coordinates of the origin and destination stops as well as the planned departure time at the origin stop and the planned arrival time at the destination stop. The APC data contained a line identifier, a stop identifier, a coordinate, a departure time and the recorded number of passengers getting on and off at that stop.
Methods: To investigate whether using route requests and responses can improve the prediction of ridership, we developed multiple machine learning models. Each model was trained and evaluated using two datasets. The first dataset contained all available information. The second dataset did not contain any information about the route responses. In this way, we were able to evaluate the impact of the API information to the accuracy of ridership prediction. The target variable of our prediction was the change in ridership at each station, meaning the difference between boarding and alighting passengers at each station. Using this value, we then calculated the total number of passengers after each station over the course of the journey. We chose Random Forest (RF) and Gradient Boosted Trees (GBT) for our approach. Our literature review showed that these algorithms can perform well in predicting ridership. In related work using datasets similar to ours, these algorithms performed the best [32,40,50]. In addition, with tree-based algorithms, it is possible to obtain an insight into which features are important for the forecast. We hoped this would give us further insights into how important the API data are for the prediction. Before we could apply these algorithms, however, we had to overcome some challenges in matching the passenger count data with the route responses, as the two datasets contained different IDs that had to be matched to each other. Specifically, we had to match each leg proposed to the users with the trip that was operated by the transportation company. None of the datasets used standardized or uniform IDs to designate these trips or legs. In addition, the IDs of the stop identifiers in the two datasets did not match. The standardized IFOPT stop IDs were used for the route response data, but not for the APC data. Instead, proprietary IDs were used there. Furthermore, the public line identifier also differed between the two datasets. Therefore, we had to take the following elaborate approach to be able to link the two datasets. First, we used the coordinates of the stops that were given in both data sets. Using those coordinates, we calculated the nearest matching stops between both datasets. This allowed us to map the stops. Next, the assignment of legs and trips was made using the following criteria:
  • The line identifier of the trip in the APC data must be the same identifier as the line identifier of the leg in the trip route response.
  • One stop of the APC trip must correspond to the origin stop of the leg in the route response.
  • The departure time at this stop must be the same or similar in the APC data as in the route response.
  • The possible APC trip must also serve the destination after the origin stop of the leg.
Using this method, we successfully merged the two datasets and were able to apply the algorithms to the data. However, the matching process proved to be quite complex and therefore time consuming. We could see at this stage that, given the current data format, a real-time prediction of vehicle occupancy would not be feasible. However, we continued with our work to explore prediction algorithms. In the course of feature engineering, we generated additional features using various aggregation methods. For example, the average number of passengers boarding and alighting for different departure time windows, or how many stops the vehicle had already made before arriving at the current stop. In addition to the data provided by the transport companies, we also integrated weather data, such as mean temperature, wind speed or measured precipitation, for the respective departure day into our models. Weather has been shown to influence public transport usage [51].
Results and Future Work: Using tree-based algorithms allowed us to analyze which features of the data were used for the prediction by inspecting feature importances. Interestingly, the relative importance of features for the prediction was very similar for the data of the two different public transport companies. This indicates that a model could be reusable for a different public transport company and a different public transport network, without requiring a completely new training phase.
To measure the accuracy of our prediction, we used two criteria. The first criterion was the root-mean-squared error (RMSE) between our prediction of the ridership change and the true observed value. The RMSE was chosen to penalize and prevent high deviations in the forecast more strongly. This criterion was also used to tune the hyperparameter of the machine learning models. For the tuning, we used a five-fold randomized cross validation. The hyperparameters for both the Random Forest model and the Gradient Boosted Trees model were set using a randomized parameter optimization [52]. We then used several hyperparameter sets. For the n_estimators parameter, we identified a range between 50 and 250 and used a random value in this range for model training. For max_features, we tried to use all features, using max_features = n_features and max_features = sqrt(n_features). For min_samples_split, we used the values of 2, 5 and 10, while for min_samples_leaf, we used 2, 5, 10, 15 and 100 in trainings. For the second accuracy criterion, we used the calculated ridership using the predicted ridership change. This value was compared to the observed ridership at a threshold of 10 passengers. This makes it possible to map a percentage accuracy that, in contrast to the RSME, allows a better comparison with other studies. The results between the datasets of the two transport companies were similar. For reasons of clarity, we only present the results of one of the companies in this paper, as shown in Table 4.
As can be seen, the results of the RF model were better than those of GBT. The inclusion of API data improves the prediction of total ridership by almost 15%.
Thus, the API data seem to be of great use for forecasting ridership. Nevertheless, we find it difficult to make a definitive assessment of our results. The background is that the study period in this project was during a peak phase of the COVID-19 pandemic. As a result, the observed passenger numbers were significantly lower than in normal operation. The president of the association of German transport companies estimates that the number of public transport passengers in Germany in February 2021 was between 60% and 70% lower than during normal operation [53]. We assume that this circumstance significantly influences our results. We hope to repeat the study soon under normal conditions. Future research could also investigate the impact of including API data when other complex machine learning models, such as deep neural networks, are used to predict passenger data.

4.4. Project 4: Analyzing Usage of Bike and E-Scooter Sharing

Use Case: An issue of public transport often is the problem of the “last mile”, meaning that passengers need to cover the last trip leg from a stop to their final destination in some way and long distances between stops and final destinations can make people hesitant to use public transport. With the emergence of bike and e-scooter sharing services, these services have often been proposed as a good complement to public transport, because they can enable passengers to cover their last mile comfortably. However, the usage patterns of bike and e-scooter sharing services have not been investigated in relation to public transport yet. In a related work, Albuquerque et al. analyzed bike-sharing data from Lisbon to identify mobility patterns for the optimization of bike-sharing services [21]. Big data analysis has also been used for fleet management of shared mobility services [54].
The idea of this project is, therefore, to use historical data from e-scooter and bicycle sharing providers to make predictions about vehicle movements in the near future. On the one hand, knowing patterns in vehicle movements and distribution could result in operational advantages and, on the other hand, it can enable users to plan with greater reliability, since booking in advance is often only possible to a very limited extent. As discussed above, sharing vehicles are a good complement to public transport when it comes to the last mile of a passenger’s trip. However, a certain reliability of the service is needed. With conventional public transport, this is ensured by the timetable. In the case of free-floating sharing vehicles, reliability has to date been based at most on experience. The prediction developed in this project should reflect such experiences.
Data: We collected data of March and April 2021 from two sharing providers in the city of Karlsruhe. The data were retrieved using application programing interfaces from the operators and was requested at minute intervals. Data were stored with an entry for each trip, comprising origin and destination as well as the respective times. However, data from various providers differ in its details. For example, there are differences in the accuracy of the location data of the vehicles. Additionally, available application programing interfaces are difficult to use. There is one provider that, for example, depending on the size of the queried area, only provides geographically summarized data and does not provide the exact vehicle positions. The APIs of the sharing providers Nextbike (Bicycles) and Tier (e-scooter), which were used in the project, provide unique identifiers and precise coordinates and were chosen because of the details in the data. In addition, the sharing systems differ fundamentally, being either station-based or free-floating networks or a combination of these. Nextbike offers a free-floating network that is extended by additional stations and through which the user can reserve a vehicle only 30 min in advance. It is even simpler with the Tier provider. It has a free-floating network without stations or advance booking. This heterogeneity, however, impacts the portability of our approach.
Methods: Two main methods from the field of machine learning were used to answer the questions. One of them was a historic cluster with a k-nearest-neighbor algorithm and the other a convolutional variational autoencoder (CVAE) in combination with a long short-term memory model (LSTM) to predict the network state of sharing providers. The approach of the historical cluster starts a clustering in historical data based on a query with request time and planned trip time. The 3 most similar network states are searched with an KNN (k-nearest neighbor) algorithm, and their development is analyzed. We tried several configurations for the number of nearest neighbors, k, and arrived at k = 3 as a suitable parameter. As a result, a probability value is obtained as to whether vehicles that are available at the time of the request will also be available at the planned time of the trip. The second approach does not consider individual queries, but attempts to predict the development of the entire network. A CVAE is trained in 100 epochs with the help of 6000 historical network states. The autoencoder tries to reduce the network state to two values (dimensions) at a time. The autoencoder tries to reduce the network state to two values at a time. In different iterations of learning, the model is trained to first reduce a network state to two values and then to generate the network state to match the original network state as precisely as possible. In the next step, an LSTM is trained in 120 epochs and 3 Layers with the many two-value pairs of the different time steps. Based on this time series, the LSTM then can predict new values for the future based. With the future values, a network state of the future can now be generated using the CVAE.
Results and Future Work: Our results show that such a prediction with machine learning methods is possible, but the reliability of the results currently has a broad range, e.g., depending on the prediction length. As shown in the diagram in Figure 11, the first methodology using the KNN algorithm has an accuracy of about 80% for up to 6 h of prediction. After that, this value quickly drops to about 50% accuracy, which means that good predictions can no longer be made. The value of 80% is also not suitable for reliable planning, but it gives a good indication for further analysis, for example. These values were created with 100 test requests at sample locations in the historical data.
With the second methodology, which should predict the whole network, we had mixed results. The part in which network states are compressed and regenerated with the help of the CVAE works very well after some adjustments. The ELBO (Evidence lower bound) value of the CVAE training is an indication of how well the model is trained. While an upgrade from 10 to 100 epochs in testing improves the model greatly in increase, further upgrades up to 1000 epochs show little improvement. However, the prediction of the compressed values using the LSTM could not yet be sufficiently adapted, so that the results after the prediction and re-generation of a network state appear very blurred and deviate too much from the real development. In this case, an even broader database would contribute to more accurate results, which is part of our future work. In addition, further research should be conducted on the configurations of the prediction models. Prediction models other than the LSTM or particular subsets of it could be tested. In addition, the dimensions of the CVAE could be increased to increase the number of predicted values and thus perhaps achieve better results in the prediction model.

4.5. Project 5: Detecting Anomalies in Vehicle Data

Use Case: In another research project, we analyzed the vehicle data that is recorded by the on-board computers of trams and buses. In contrast to data from passenger information systems and usage data of sharing vehicles, these data are not passenger or usage related. The objective of this project was to use vehicle data for the analysis of public transport operations. The aim was to detect anomalies in the data and consequently in the service performed. For instance, if a vehicle had to take a different route than usual due to a disruption in the traffic network, the data should show this deviation. A retrospective analysis of the data can uncover the frequency of such disruptions, for example, and give insights into underlying problems. Additionally, undiscovered errors of a system component can lead to anomalies in the data. An analysis of the data can uncover unknown problems. Public transport companies are interested in detecting these anomalies to identify faulty components and to analyze vehicle and network performance. Additionally, deviations from the vehicle route are often not recorded on other systems and therefore can not be retraced in retrospect during network evaluations or in network planning. The analysis of GPS-based data for anomalies has been investigated for air traffic by Luis Basora et al. and for individual traffic by Li Cai et al., for example [55,56]. For the detection of anomalies in railway infrastructure systems, da Silva Ferreira et al. presented an analysis of unsupervised machine learning methods [57].
Data: Each on-board computer logs events from the trams’ various system components. The reception of a new GPS coordinate, the opening of the vehicle doors, or whether the driver selected a new destination or a new line, or if the voice radio was activated are entered in the log, for example. The on-board computer that produced the data we analyzed generates one log file per day and per vehicle. The data available for our project were recorded from 4 January 2021 to 20 April 2021. The 472 vehicles generated 139.859 files during this period. Each file has an average size of 5.6 MB. Thus, over 780 GB of data were recorded during our study period. This corresponds to more than 9 billion logged events. This again displays the mass of data that is generated during public transport operation and guided our selection of methods.
Methods: To be able to efficiently manipulate and analyze the data, we used a high-performance computer (HPC). The first errors and anomalies were identified when the data were imported. These were mainly due to faulty software components. To make the data usable, a complex preprocessing had to be carried out, since necessary information, such as coordinates or line designations, had to be extracted first. To detect operational anomalies, we focused on routes. These are represented in the data by the recorded geo-coordinates of the vehicles. Since we had no labeled data available, we chose an unsupervised machine learning approach as a first step for anomaly detection.
We chose the cluster algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to start our analysis and cluster regular and anomalous trips. We chose the DBSCAN algorithm because it has been used successfully for anomaly detection before [58]. Additionally, it is not necessary to specify the number of clusters as a parameter. This is a crucial advantage because the number of possible anomalies is unknown. For the calculation of the distances between the trajectories, the Hausdorff distance was chosen based on the literature review by Philippe Besse et al. [59]. Since this is an unsupervised machine learning problem where there is no ground truth, the tuning of hyperparameters had to be performed by the visual analysis of retrospective results. This was a time-consuming effort and one of the biggest challenges in the entire project.
Results and Future Work: We applied the procedure to different subsets of the data, separating data from lines and using data from short time periods. Figure 12 and Figure 13 show the results of clustering 1000 trips of a single line, for example. For this clustering result, the parameters were set to the following values: Epsilon = 0.09 and min_samples = 50. Cluster 0 and cluster 1, depicted in Figure 12, contain regular trips, each in one direction of travel. In the resulting clusters, trips starting from the most eastern stop (in the graph on the right) are distinguished and clustered separately from trips ending on this stop. These trips differ because the vehicle takes a slightly different route on this stop when it is starting from there versus when it is ending the trip there. With a different hyperparameter set, these trips are all clustered in one single cluster. However, in most hyperparameter settings that clustered all regular trips into one cluster, some anomalies were clustered in this cluster as well. We decided to use hyperparameters that distinguish the two types of regular trips and identify anomalies reliably. Figure 13 shows the cluster containing anomalous trips. In this set, 133 anomalies were identified. Several types of anomalies can be identified using the color coding of trips in Figure 13, meaning that many anomalies occur repeatedly. These can then be reviewed by domain experts to identify the actual routes of the anomalies and the reasons for the deviations.
For this subset, the clustering procedure works well. Regular trips are assigned to the respective clusters. Anomalies are sorted out and can be considered in further process steps. Limitations arise in the portability of the method between different transport systems and in the use of the entire dataset in contrast to using only subsets. Both traditional trams and tram-trains operate in the public transport network from which we received our data. While trams mainly operate in the inner-city area, tram-train lines also serve regional areas. This results in several differences between the two transport systems, for example, in line length, cycle times and the amount of line variations. This affects the determination of the hyperparameters and the result of the clustering. A tram-train line requires a much higher distance threshold (Epsilon) to achieve accurate results than an inner-city tram line. These results suggest that a division of the data in subsets for each transport system is a reasonable approach. However, we still intend to increase the time periods of the data we use and use data of several lines of trams or, respectively, train-trams together.
We are currently investigating the application of other cluster algorithms, such as HDBSCAN (Hierarchical-DBSCAN), which allows a flexible choice of the distance threshold. Another limitation of our method is the computationally complex calculation of the Hausdorff distance matrix. In the current method, the distance of all trips to each other must be determined. We are currently exploring if we can speed up our calculation by using GPUs rather than CPUs.
We further want to explore the reasons and effects of the identified anomalies. For this, we are currently developing an interactive map that will allow us to study similar journeys and the transport network as a whole, picking up the insights of our visual analytics project.

5. Challenges, Solutions and Lessons Learned

In our data science projects in the field of public transport presented in this paper, we noticed some difficulties handling public transport data and accessing the potential of the data. Some of these difficulties are certainly not unique to the field of public transport and can be encountered in general in the application of data science to big data. Other difficulties, however, occur repeatedly when working with public transport data and point to opportunities to make a difference for the future of data analysis in public transport by addressing them.
In this section, we discuss these difficulties, what their reasons might be and how they can be mitigated.
For scientists, it is often difficult to obtain and work with suitable datasets from the public transport sector due to a lack of available datasets. From the point of view of transport companies, the data are often highly sensitive. In most cases, transport companies are in close competition with each other. In this context, data, such as passenger count data, can provide an unfair competitive advantage for competitors. This in turn means that transport companies are often reluctant to make such data available to researchers or even the public, by implementing an open data policy. Part of the solution may be laws and policies that encourage transport operators to share their data and adhering to open data policies. In public institutions, such guidelines already exist, for example, the E.U. Open Data Directive, or laws, such as the Open-Data-Law in Germany. It is conceivable to extend these guidelines and laws to transport companies, which are often already in public hands or subsidized by public funds. This was implemented, for example, in Germany last year with the adoption of the second Open Data Act. Some public transport agencies have already implemented Open Data policies on their own in recent years, although the extent of the data they provide under these policies is quite different.
In turn, data that would allow to create movement profiles of public transport passengers or allow to identify them, for example, is highly sensitive and should be and remain protected. At the same time, some anonymization measures can obstruct the application of data science methods and prevent meaningful analysis, as we observed in our own work. In order to address this challenge, it is essential to investigate and apply anonymization and privacy preserving measures that are compatible with the chosen data science methods, such as those reported by Kallista Bonawitz et al., for example [60]. There are also methods specifically for spatiotemporal trajectory data that address the need to analyze location data, but preserve the privacy of the users, as proposed by Sina Shaham et al., for example [61]. As the adaptation of data science methods for public transport data is currently in its early stages, public transport agencies as data custodians are still beginning to understand the management of their data. Applying privacy preserving methods requires a deep understanding not only of promising approaches towards the data, but also the management of the data itself, which is currently still developing in the public transport domain. Therefore, it is an important goal to deepen this understanding and to work towards privacy preserving data management as a collaborative goal of public transport agencies and researchers.
Often, access to relevant data to pursue a specific use case is obstructed due to organizational obstacles. Data sources are managed by different departments of the public transport association and, as described above, a unified data management approach that aims at utilizing these data is, at best, in its very early stages. An essential first step for initiating organizational change towards such a unified data management is the realization of the potential lying in the data. Public transport agencies are just now realizing how data-driven optimization could benefit the modernization and advancement of public transport. Exploring, clarifying, and explaining this potential has been one of our core goals with the pursuit of the projects described in this paper.
Another hurdle for data analysis in public transport is the variety of different identifiers that are used in public transport data. Part of this problem is that the process of defining consistent identifiers is time consuming and labor intensive and it requires cooperation between different public transport providers and software companies. In the field of public transport, this is even more true due to the large number of companies, system components, and the associated large number of stakeholders. Moreover, transport companies often operate beyond the borders of cities, districts, states, and countries. This further complicates the design of standards. Another part of the problem is that, while there are standards for some of these identifiers, they are often not consistently implemented. The planning and operation of the public transport network requires a variety of systems and components that were developed and deployed for specific tasks, but have evolved to support additional tasks and to provide new interfaces, extending their application domain. Public transport agencies often operate legacy systems using outdated data formats. Therefore, some of the system components in public transport use the developed standards for information exchange between systems, but others do not. It is often hard to upgrade the different systems to use the standards. Possible reasons for this are that it would either interrupt operations, be costly, or the software manufacturer has not yet implemented the standards. Stronger subsidies in public passenger transport focusing on unlocking the potential of public transport data to develop a modern sustainable public transport could certainly make it easier for transport and partner companies to upgrade the system to the current standards.
Mappings between these diverse identifiers are specific and tailored solutions, since every public transport operator has a different system setup and the variety of implemented data formats therefore is very high. This obstructs the development of general solutions and mappings and results in relatively expensive, not easily portable preprocessing for the application of machine learning or big data methods. Additionally, such mappings are often time and resource consuming and impede the development of real-time-enabled solutions. A unification of identifiers and usage of standards could improve the usability of public transport data and advance the field towards real-time capable solutions. Meanwhile, mapping tables can be clumsy, but effective short-term solutions for some of these problems.
Many effects that manifest themselves during the data analysis need to be interpreted by domain experts. One example is the identification of bot requests in our analysis of route requests. The experts that manage and maintain the IT infrastructure of a public transport provider have a deep knowledge of their data and systems and their support is needed to develop preprocessing routines for the data efficiently. In the same way, domain experts that know the public transport network and operations are essential in interpreting analysis results in every stage of the analysis, supporting the decision of which approaches to pursue further, but also utilizing the final analysis results for optimization. In addition to a thorough requirement analysis to specify the requirements for a certain use case, these experts should be integrated in the development process. Developing suitable visualizations is vital for this integration to succeed. Especially for public transport data analysis, we observed that visualizations are essential, but also need to be developed carefully, to be understandable by domain experts. We found that interactive visualizations are especially helpful to access multi-dimensional data. We therefore will continue to research suitable visualizations and interactions for public transport data analysis.

6. Summary and Outlook

The progressing digitization of public transport and the vast amount of data generated in public transport allows rich data analysis and the application of a variety of methods, ranging from visualization to machine learning, to advance the understanding of public transport and to develop a foundation for data-driven optimization. For public transport to fulfill its role in a sustainable mobility, the potential for optimization that lies in utilizing public transport data should be unlocked. As we showed in this paper, there are several different data sources in public transport and a variety of use cases that are worthwhile to explore. However, the current data formats and policies around data usage can complicate data analysis for public transport. The implementation of unifying standards and the continuous modernization of public transport infrastructure can mitigate this problem. Open Data policies—in laws or organizational policies—can help to spark a wide range of research and advance the knowledge about data analysis in public transport as well as raise awareness of the potential that lies in the data. Some projects towards unified data standards, open data policies and consistent data management have already been launched, but these initiatives should be intensified and accelerated. Key to this is to broaden the understanding of the potential of public transport data and to demonstrate the benefits of data analysis. In our opinion, it is highly beneficial to investigate further the visual analytics of public transport data and of data analysis results. High quality visualizations are often complex to develop, but they contribute greatly to the understanding between data scientists and domain experts. We therefore argue that a toolbox of visualization tools, specifically for transport and mobility data, would greatly simplify the implementation of interdisciplinary analysis projects for mobility data. We are further developing our visualization dashboard presented in Section 4.1 to include additional visualizations and data sources. User tests of the dashboard are planned to improve usability and interaction. Future work on our project about travel demand analysis presented in Section 4.2 involves privacy preserving methods for handling request data that enables us to use position data in user requests in a sensitive way. We hope to build on results from this line of work in several other projects planned at this time. For the analysis of travel demand, we are looking to join our insights from this project and our project on sharing services, described in Section 4.4. We are interested in investigating which methods could expand knowledge on travel demands and behavior when data from several data sources and potentially several mobility services are joined. A worthwhile objective would be the prediction of travel demand for several types of transport modes. For the prediction of passenger numbers in vehicles, as described in Section 4.3, we look forward to resuming our efforts using data from a non-COVID-19 time and to compare our findings for both datasets. Additionally, we plan to use deep learning approaches to explore the problem of predicting passenger numbers from route requests of passenger information systems. The results of our anomaly detection project (Section 4.5) show the necessity to involve domain experts in the interpretation of analysis results. We therefore plan to advance this project further by including these results in our visualization dashboard. We would like to implement an interactive visualization of the clustering results to enable domain experts to classify the detected anomalies further. The labeled data generated in such a step should be a basis for future work on applying supervised learning methods to pursue anomaly identification.
The projects presented in this paper are of an explorative nature, trying to illustrate the potential of public transport data and challenges in this field of research. We hope to encourage a discussion about data management and data analysis in the public transport domain and to share our experiences and findings with the research community to discuss suitable methods and approaches. We intend to pursue such projects further to eventually help to pave the way for data-driven optimization for sustainable public transport.

Author Contributions

Conceptualization, C.K. and F.G.; Project administration, T.S.; Software, F.G. and C.F.G.; Supervision, T.S.; Visualization, F.G. and C.F.G.; Writing—original draft, C.K., F.G. and C.F.G.; Writing—review & editing, C.K. and T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the German Federal Ministry of Education and Research grant number 13FH225PX6 and The APC was funded by the German Federal Ministry of Education and Research. Information regarding the funder and the funding number should be provided. Please check the accuracy of funding data and any other information carefully.

Data Availability Statement

Not applicable.

Acknowledgments

This work was conducted within the scope of the research project “VSB-ÖP: Verlässlichkeit von Smart- und Big Data im öffentlichen Personenverkehr” and was funded by the German Federal Ministry of Education and Research (Funding ID: 13FH225PX6). We would like to thank our project partners and workshop participants for their contributions. We also want to thank Tabea Schmidt and Jonas Hansert for their valuable contribution to our work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Karatsoli, M.; Nathanail, E. A Thorough Review of Big Data Sources and Sets Used in Transportation Research. In Reliability and Statistics in Transportation and Communication; Springer: Cham, Switzerland, 2018; pp. 540–550. [Google Scholar]
  2. Zannat, E.; Khatun, C.F.C. Emerging Big Data Sources for Public Transport REVIEW. J. Indian Inst. Sci. 2019, 99, 601–619. [Google Scholar] [CrossRef] [Green Version]
  3. Welch, T.F.; Widita, A. Big data in public transportation: A review of sources and methods. Transp. Rev. 2019, 39, 795–818. [Google Scholar] [CrossRef]
  4. Bohnenkamp, C. Hannover: Üstra Will Fahrgäste in Bahnen Wiegen. Available online: https://www.neuepresse.de/Hannover/Meine-Stadt/Hannover-Uestra-will-Fahrgaeste-in-Bahnen-wiegen (accessed on 20 March 2021).
  5. Verband Deutscher Verkehrsunternehmen e.V. VDV-Schrift 457 Automatische Fahrgastzählsysteme; Technical Reports; Verband Deutscher Verkehrsunternehmen e.V.: Cologne, Germany, 2018. [Google Scholar]
  6. Haller, O. Automatic Passenger Counting—An Overview. Available online: https://www.isarsoft.com/blog/automatic-passenger-counting-an-overview/ (accessed on 30 December 2021).
  7. Corazza, M.V.; Vasari, D.; Petracci, E.; Brambilla, L. Predictive Maintenance for Buses: Outcomes and Potential from an Italian Case Study. In Data Analytics: Paving the Way to Sustainable Urban Mobility; Springer: Cham, Switzerland, 2019; pp. 461–468. [Google Scholar]
  8. Reinhardt, W. Öffentlicher Personennahverkehr; Springer: Cham, Switzerland, 2018. [Google Scholar]
  9. Chen, C.; Bian, L.; Ma, J. From traces to trajectories: How well can we guess activity locations from mobile phone traces? Transportation Res. Part C Emerg. Technol. 2014, 46, 326–337. [Google Scholar] [CrossRef]
  10. Wang, Z.; He, S.Y.; Leung, Y. Applying mobile phone data to travel behaviour research: A literature review. Travel Behav. Soc. 2018, 11, 141–155. [Google Scholar] [CrossRef]
  11. Nikolaidou, A.; Papaioannou, P. Utilizing Social Media in Transport Planning and Public Transit Quality: Survey of Literature. J. Transp. Eng. Part A Syst. 2018, 144, 4018007. [Google Scholar] [CrossRef]
  12. Petersen, N.C.; Rodrigues, F.; Pereira, F.C. Multi-output bus travel time prediction with convolutional LSTM neural network. Expert Syst. Appl. 2019, 120, 426–435. [Google Scholar] [CrossRef] [Green Version]
  13. Yu, B.; Wang, H.; Shan, W.; Yao, B. Prediction of Bus Travel Time Using Random Forests Based on Near Neighbors. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 333–350. [Google Scholar] [CrossRef]
  14. Darwish, A.; Khalil, M.; Badawi, K. Optimising Public Bus Transit Networks Using Deep Reinforcement Learning. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–7. [Google Scholar]
  15. Holleczek, T.; Yu, L.; Lee, J.K.; Senn, O.; Ratti, C.; Jaillet, P. Detecting Weak Public Transport Connections from Cellphone and Public Transport Data. In Proceedings of the 2014 International Conference on Big Data Science and Computing, Beijing, China, 4–7 August 2014; Association for Computing Machinery: New York, NY, USA, 2014. [Google Scholar]
  16. Li, H.; Parikh, D.; He, Q.; Qian, B.; Li, Z.; Fang, D.; Hampapur, A. Improving rail network velocity: A machine learning approach to predictive maintenance. Transp. Res. Part C Emerg. Technol. 2014, 45, 17–26. [Google Scholar] [CrossRef]
  17. Falamarzi, A.; Moridpour, S.; Nazem, M. Development of a tram track degradation prediction model based on the acceleration data. Struct. Infrastruct. Eng. 2019, 15, 1308–1318. [Google Scholar] [CrossRef]
  18. Le Nguyen, M.H.; Turgis, F.; Fayemi, P.-E.; Bifet, A. Challenges of Stream Learning for Predictive Maintenance in the Railway Sector. In IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning; Springer: Cham, Switzerland, 2020; pp. 14–29. [Google Scholar]
  19. Xie, J.; Huang, J.; Zeng, C.; Jiang, S.-H.; Podlich, N. Systematic Literature Review on Data-Driven Models for Predictive Maintenance of Railway Track: Implications in Geotechnical Engineering. Geosciences 2020, 10, 425. [Google Scholar] [CrossRef]
  20. van Oort, N.; Cats, O. Improving Public Transport Decision Making, Planning and Operations by Using Big Data: Cases from Sweden and the Netherlands. In Proceedings of the 2015 IEEE 18th International Conference on Intelligent Transportation Systems, Gran Canaria, Spain, 15–18 September 2015; pp. 19–24. [Google Scholar]
  21. Albuquerque, V.; Andrade, F.; Ferreira, J.C.; Dias, M.S.; Bacao, F. Bike-sharing mobility patterns: A data-driven analysis for the city of Lisbon. EAI Endorsed Trans. Smart Cities 2021, 5, e7. [Google Scholar] [CrossRef]
  22. Yaghini, M.; Khoshraftar, M.M.; Seyedabadi, M. Railway passenger train delay prediction via neural network model. J. Adv. Transp. 2013, 47, 355–368. [Google Scholar] [CrossRef]
  23. Colpaert, P.; Chua, A.; Verborgh, R.; Mannens, E.; Van de Walle, R.; Vande Moere, A. What Public Transit API Logs Tell Us about Travel Flows. In Proceedings of the 25th International Conference Companion on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2016; pp. 873–878. [Google Scholar]
  24. Corman, F.; Kecman, P. Stochastic prediction of train delays in real-time using Bayesian networks. Transp. Res. Part C Emerg. Technol. 2018, 95, 599–615. [Google Scholar] [CrossRef] [Green Version]
  25. Han, Y.; Francois, O.; Same, A.; Bouillaut, L.; Oukhellou, L.; Aknin, P.; Branger, G. Online predictive diagnosis of electrical train door systems. In Proceedings of the 10th World Congress on Railway Research (WCRR 2013), Milan, Italy, 24–27 November 2013. [Google Scholar]
  26. Davari, N.; Veloso, B.; Costa, G.d.A.; Pereira, P.M.; Ribeiro, R.P.; Gama, J. A Survey on Data-Driven Predictive Maintenance for the Railway Industry. Sensors 2021, 21, 5739. [Google Scholar] [CrossRef] [PubMed]
  27. Kalathas, I.; Papoutsidakis, M. Predictive Maintenance Using Machine Learning and Data Mining: A Pioneer Method Implemented to Greek Railways. Designs 2021, 5, 5. [Google Scholar] [CrossRef]
  28. Chen, W.; Zhuang, P.; Liang, H. Reinforcement Learning for Smart Charging of Electric Buses in Smart Grid. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 13–19 December 2019; pp. 1–6. [Google Scholar]
  29. Pamuła, T.; Pamuła, W. Estimation of the Energy Consumption of Battery Electric Buses for Public Transport Networks Using Real-World Data and Deep Learning. Energies 2020, 13, 2340. [Google Scholar] [CrossRef]
  30. Wang, S.; Lu, C.; Liu, C.; Zhou, Y.; Bi, J.; Zhao, X. Understanding the Energy Consumption of Battery Electric Buses in Urban Public Transport Systems. Sustainability 2020, 12, 10007. [Google Scholar] [CrossRef]
  31. Vandewiele, G.; Colpaert, P.; Janssens, O.; Van Herwegen, J.; Verborgh, R.; Mannens, E.; Ongenae, F.; De Turck, F. Predicting Train Occupancies Based on Query Logs and External Data Sources. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1469–1474. [Google Scholar]
  32. van Roosmalen, J.J. Forecasting Bus Ridership with Trip Planner Usage Data: A Machine Learning Application. Master’s Thesis, University of Twente, Enschede, The Netherlands, March 2019. [Google Scholar]
  33. Chen, C.; Ma, J.; Susilo, Y.; Liu, Y.; Wang, M. The promises of big data and small data for travel behavior (aka human mobility) analysis. Transp. Res. Part C Emerg. Technol. 2016, 68, 285–299. [Google Scholar] [CrossRef] [Green Version]
  34. Ma, X.; Wu, Y.-J.; Wang, Y.; Chen, F.; Liu, J. Mining smart card data for transit riders’ travel patterns. Transp. Res. Part C Emerg. Technol. 2013, 36, 1–12. [Google Scholar] [CrossRef]
  35. Briand, A.-S.; Côme, E.; Trépanier, M.; Oukhellou, L. Analyzing year-to-year changes in public transport passenger behaviour using smart card data. Transp. Res. Part C Emerg. Technol. 2017, 79, 274–289. [Google Scholar] [CrossRef]
  36. Kieu, L.M.; Bhaskar, A.; Chung, E. Mining temporal and spatial travel regularity for transit planning. In Australasian Transport Research Forum 2013 Proceedings; Australasian Transport Research Forum: Brisbane, Australia, 2013; pp. 1–12. [Google Scholar]
  37. Morency, C.; Trepanier, M.; Agard, B. Analysing the Variability of Transit Users Behaviour with Smart Card Data. In Proceedings of the 2006 IEEE Intelligent Transportation Systems Conference, Toronto, ON, Canada, 17–20 September 2006; pp. 44–49. [Google Scholar]
  38. Poussevin, M.; Tonnelier, E.; Baskiotis, N.; Guigue, V.; Gallinari, P. Mining Ticketing Logs for Usage Characterization with Nonnegative Matrix Factorization. In Big Data Analytics in the Social and Ubiquitous Context; Springer: Cham, Switzerland, 2016; pp. 147–164. [Google Scholar]
  39. El Mahrsi, M.K.; Côme, E.; Oukhellou, L.; Verleysen, M. Clustering Smart Card Data for Urban Mobility Analysis. IEEE Trans. Intell. Transp. Syst. 2017, 18, 712–728. [Google Scholar] [CrossRef]
  40. Toqué, F.; Khouadjia, M.; Come, E.; Trepanier, M.; Oukhellou, L. Short & long term forecasting of multimodal transport passenger flows with machine learning methods. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 560–566. [Google Scholar]
  41. Han, Y.; Wang, C.; Ren, Y.; Wang, S.; Zheng, H.; Chen, G. Short-Term Prediction of Bus Passenger Flow Based on a Hybrid Optimized LSTM Network. ISPRS Int. J. Geo-Inf. 2019, 8, 366. [Google Scholar] [CrossRef] [Green Version]
  42. Liu, W.; Tan, Q.; Wu, W. Forecast and Early Warning of Regional Bus Passenger Flow Based on Machine Learning. Math. Probl. Eng. 2020, 2020, 6625435. [Google Scholar] [CrossRef]
  43. Lathia, N.; Capra, L. How Smart is Your Smartcard? Measuring Travel Behaviours, Perceptions, and Incentives. In Proceedings of the 13th International Conference on Ubiquitous Computing, Beijing, China, 17–21 September 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 291–300. [Google Scholar]
  44. Ghahramani, N.; Brakewood, C. Trends in Mobile Transit Information Utilization: An Exploratory Analysis of Transit App in New York City. J. Public Transp. 2016, 19, 139–160. [Google Scholar] [CrossRef] [Green Version]
  45. Ghahramani, N.; Brakewood, C. Requests for Ridehailing During an Extreme Weather Event: Exploratory Analysis of New York City. J. Urban Plan. Dev. 2020, 146, 04020006. [Google Scholar] [CrossRef]
  46. Remy, C.; Brakewood, C.; Ghahramani, N.; Kwak, E.J.; Peters, J. Transit Information Utilization during an Extreme Weather Event: An Analysis of Smartphone App Data. Transp. Res. Rec. 2018, 2672, 90–100. [Google Scholar] [CrossRef]
  47. Wörner, M.; Ertl, T. Visual Analysis of Public Transport Vehicle Movement. In Proceedings of the EuroVA 2012: International Workshop on Visual Analytics, Vienna, Austria, 4–5 June 2012. [Google Scholar]
  48. Zeng, W.; Fu, C.-W.; Müller Arisona, S.; Erath, A.; Qu, H. Visualizing Waypoints-Constrained Origin-Destination Patterns for Massive Transportation Data. Comput. Graph. Forum 2016, 35, 95–107. [Google Scholar] [CrossRef]
  49. Bagchi, M.; White, P.R. The potential of public transport smart card data. Transp. Policy 2005, 12, 464–474. [Google Scholar] [CrossRef]
  50. Jamar, L.; Büchel, B.; Corman, F. A Network-Wide Approach to Predicting Urban Public Transport Passenger Numbers at a Stop-to-Stop Level; Technical Reports; ETH Zürich: Zürich, Switzerland, 2020. [Google Scholar]
  51. Zhou, M.; Wang, D.; Li, Q.; Yue, Y.; Tu, W.; Cao, R. Impacts of weather on public transport ridership: Results from mining data from different sources. Transp. Res. Part C Emerg. Technol. 2017, 75, 17–29. [Google Scholar] [CrossRef] [Green Version]
  52. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  53. Verband Deutscher Verkehrsunternehmen e.V. Die ÖPNV-Bilanz des Corona-Jahres 2020. Available online: https://www.vdv.de/presse.aspx?id=458fc281-0ec8-4de5-a676-ecdad74ee0ad&mode=detail (accessed on 1 February 2022).
  54. Wen, J.; Zhao, J.; Jaillet, P. Rebalancing shared mobility-on-demand systems: A reinforcement learning approach. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 220–225. [Google Scholar]
  55. Basora, L.; Morio, J.; Mailhot, C. A Trajectory Clustering Framework to Analyse Air Traffic Flows. In Proceedings of the SID 2017, 7th SESAR Innovation Days, Belgrade, Serbia, 28–30 November 2017. [Google Scholar]
  56. Cai, L.; Li, S.; Wang, S.; Liang, Y. GPS Trajectory Clustering and Visualization Analysis. Ann. Data Sci. 2018, 5, 29–42. [Google Scholar] [CrossRef]
  57. da Silva Ferreira, M.; Vismari, L.F.; Cugnasca, P.S.; de Almeida, J.R.; Camargo, J.B.; Kallemback, G. A Comparative Analysis of Unsupervised Learning Techniques for Anomaly Detection in Railway Systems. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 444–449. [Google Scholar]
  58. Basora, L.; Olive, X.; Dubot, T. Recent Advances in Anomaly Detection Methods Applied to Aviation. Aerospace 2019, 6, 117. [Google Scholar] [CrossRef] [Green Version]
  59. Besse, P.; Guillouet, B.; Loubes, J.-M.; François, R. Review & Perspective for Distance Based Trajectory Clustering; Technical Reports. arXiv preprint 2015, arXiv:1508.04904. [Google Scholar] [CrossRef]
  60. Bonawitz, K.; Kairouz, P.; McMahan, B.; Ramage, D. Federated Learning and Privacy: Building Privacy-Preserving Systems for Machine Learning and Data Science on Decentralized Data. Queue 2021, 19, 87–114. [Google Scholar] [CrossRef]
  61. Shaham, S.; Ding, M.; Liu, B.; Dang, S.; Lin, Z.; Li, J. Privacy Preserving Location Data Publishing: A Machine Learning Approach. IEEE Trans. Knowl. Data Eng. 2021, 33, 3270–3283. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Categorization of data sources in public transport, examples for specific variants of these systems in blue.
Figure 1. Categorization of data sources in public transport, examples for specific variants of these systems in blue.
Sustainability 14 04211 g001
Figure 2. Difference between the route request and route response datasets. The upper part shows the request with start and destination information. The lower part shows the response. In this example, the response consists of one trip option with two legs.
Figure 2. Difference between the route request and route response datasets. The upper part shows the request with start and destination information. The lower part shows the response. In this example, the response consists of one trip option with two legs.
Sustainability 14 04211 g002
Figure 3. The spatial distribution of requests before (above) and after (below) our preprocessing.
Figure 3. The spatial distribution of requests before (above) and after (below) our preprocessing.
Sustainability 14 04211 g003
Figure 4. An interactive graph shows the relative frequency of daily requests per weekday.
Figure 4. An interactive graph shows the relative frequency of daily requests per weekday.
Sustainability 14 04211 g004
Figure 5. The dashboard application operated on a display wall using eight separate displays.
Figure 5. The dashboard application operated on a display wall using eight separate displays.
Sustainability 14 04211 g005
Figure 6. A screenshot of the dashboard application.
Figure 6. A screenshot of the dashboard application.
Sustainability 14 04211 g006
Figure 7. A section of the sankey diagram showing the frequency of requests between the stop “Ettlingen Albgaubad” and various destinations.
Figure 7. A section of the sankey diagram showing the frequency of requests between the stop “Ettlingen Albgaubad” and various destinations.
Sustainability 14 04211 g007
Figure 8. Example of a Heatmap in the Region of Munich, showing poorly served requests.
Figure 8. Example of a Heatmap in the Region of Munich, showing poorly served requests.
Sustainability 14 04211 g008
Figure 9. Example Time Wheel, showing request counts of night hours on weekends.
Figure 9. Example Time Wheel, showing request counts of night hours on weekends.
Sustainability 14 04211 g009
Figure 10. Example network spider showing requests from and to an example cluster point.
Figure 10. Example network spider showing requests from and to an example cluster point.
Sustainability 14 04211 g010
Figure 11. Prediction accuracy curve for the KNN algorithm.
Figure 11. Prediction accuracy curve for the KNN algorithm.
Sustainability 14 04211 g011
Figure 12. The results of clustering 1000 journeys of one line. Cluster 1 containing 437 trips and Cluster 0 430 trips. These are two clusters of regular tips. One cluster contains all trips starting on the most eastern stop, located on the right, while the other cluster contains all trips ending on this stop.
Figure 12. The results of clustering 1000 journeys of one line. Cluster 1 containing 437 trips and Cluster 0 430 trips. These are two clusters of regular tips. One cluster contains all trips starting on the most eastern stop, located on the right, while the other cluster contains all trips ending on this stop.
Sustainability 14 04211 g012
Figure 13. The results of clustering 1000 journeys of one line. This is the cluster containing all anomalies of journeys on this route. They are color coded.
Figure 13. The results of clustering 1000 journeys of one line. This is the cluster containing all anomalies of journeys on this route. They are color coded.
Sustainability 14 04211 g013
Table 1. Categorization of use cases for data analysis in public transport.
Table 1. Categorization of use cases for data analysis in public transport.
Public Transport Use Cases Using Data Science Methods
Tasks Concerning…CurrentShort-Term FutureMedium-Term FutureLong-Term Future
… the public transport networkdetermine current pt situationevaluate network performance
duty scheduling
predictive maintenance for infrastructure
planning of linesnetwork planning
multimodal planning
planning of infrastructure
… the timetable and public transport serviceearly detection of delays
early detection of disruptions
support of (re-) scheduling
organizing replacement services
manage on-demand services
evaluate connectionsplanning of frequencies and connectionsplanning of on-demand services
… public transport vehiclesdetection of vehicle malfunctionsplanning of charging schedules for electric buses
predictive maintenance for vehicles
vehicle capacity planning
… public transport passengersdetermining passenger numbers
directing passenger flows
prediciting traffic behavior
secure connections
predicting/analyzing public transport demand
predicting/analyzing public transport demandpredicting/analyzing public transport demand
… public transport passenger informationproviding real-time information: in standard situations, in case of delays, in case of disruptions
providing personalized information
providing critical information ahead of time (about upcoming trips or connections)
Table 2. Our projects applying data science methods to public transport data.
Table 2. Our projects applying data science methods to public transport data.
Our Projects Applying Data Science Methods to Public Transport Data
Use CaseDatasetsMethods
1Visual Analytics for Public Transportdata from automated passenger information systemsinteractive visualizations
2Analyzing Demand for On-Demand planningdata from automated passenger information systemsclustering and visualizations
3Predicting Passenger Numbers in Vehiclesdata from automated passenger information systems and automated passenger count systemsdecision trees
4Analyzing Usage of Bike- and E-Scooter Sharingtrip data from a bike sharing provider and an e-scooter sharing providerclustering and deep learning
5Detecting Anomalies in Vehicle Datalog files of public transport vehiclesspatial clustering
Table 3. The data in a route request for reference.
Table 3. The data in a route request for reference.
Data in a Route Request
timestamptimestamp of the request
date and timedate and time for which the route is to be calculated
arrival or departurewhether this date and time should be used as arrival or departure time
type of pointwhether the point is a stop, address, coordinate, street or otherinformation about origin, destination and, if specified, waypoints in between
name of pointthe name of the point
municipality codethe municipality code of the point
coordinatesthe coordinates of the point
stop idthe stop id, if the point is a stop
user agenttype of user agent (browser, app,…)
applicationwhich application was used to perform the request
optimization methodwhich optimization method should be used for route calculation, e.g.: quickest connection, cheapest connection, least changes
walking speedthe average walking speed of the user
means of transportthe means of transport available to the user at the origin and destination stops
accessibilitythe accessibility needs the user indicated
Table 4. Results of the machine learning models. RF = Random Forest; GBT = Gradient Boosted Trees.
Table 4. Results of the machine learning models. RF = Random Forest; GBT = Gradient Boosted Trees.
ModelRFGBT
RMSERidership AccuracyRMSERidership Accuracy
API data set2.77 91.37% 3.0072.27%
Control data set2.9977.37%3.2869.99%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Keller, C.; Glück, F.; Gerlach, C.F.; Schlegel, T. Investigating the Potential of Data Science Methods for Sustainable Public Transport. Sustainability 2022, 14, 4211. https://doi.org/10.3390/su14074211

AMA Style

Keller C, Glück F, Gerlach CF, Schlegel T. Investigating the Potential of Data Science Methods for Sustainable Public Transport. Sustainability. 2022; 14(7):4211. https://doi.org/10.3390/su14074211

Chicago/Turabian Style

Keller, Christine, Felix Glück, Carl Friedrich Gerlach, and Thomas Schlegel. 2022. "Investigating the Potential of Data Science Methods for Sustainable Public Transport" Sustainability 14, no. 7: 4211. https://doi.org/10.3390/su14074211

APA Style

Keller, C., Glück, F., Gerlach, C. F., & Schlegel, T. (2022). Investigating the Potential of Data Science Methods for Sustainable Public Transport. Sustainability, 14(7), 4211. https://doi.org/10.3390/su14074211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop