1. Introduction
The term “smart city” has been defined by IBM [
1] to indicate a smart city that utilizes information and communication technology to analyze and integrate the data into core systems for running the city. The key enabler of the smart city depends on the connected devices and how the collected data, generated through the Internet of Things (IoT) sensors [
2], is used. As the volume and variety of data offered by the IoT keeps increasing exponentially, how to utilize the data and transform it to knowledge for a smart city are crucial tasks for modern civilization.
A large amount of data collected from speed sensors or surveillance camera systems have been used to monitor traffic conditions on roads in an intelligence traffic system (ITS) domain. The most common detection technologies are loop detector, road-side cameras, and on-board equipment [
3]. Lv, Yisheng, et al. utilized California’s freeway traffic detector station data to predict traffic flow [
4]. Zhao, Chen also utilized Beijing’s ring road observation station, which is equipped with cameras, induction coils, and velocity radars to make short-term forecasts of traffic volumes [
5]. Similarly, closed circuit television camera (CCTV), laser detector, and loop detector data were utilized by Kim and Hong to predict the traffic flow in Suwon city intersection roads in South Korea [
6]. All of the mentioned research works relied on data from fixed traffic detectors which are costly infrastructures of a transportation system. Therefore, how to obtain less costly and more accurate traffic flow information is an important issue in this research area.
Limited budgets for ITS implementation in developing countries where the transportation infrastructure is relatively straggly hinders the detection of traffic conditions. Fortunately, the telecommunication infrastructure in these developing counties comparatively prevails. The use of smartphones, in fact, provides a sufficient source of traffic data by tracing the global positioning system (GPS) information. For example, the government of Jakarta city, the capital city of Indonesia, has taken advantage of the data feed of citizen’s smartphones to aggregate the data for traffic usage by collaborating with Waze® [
7], which is a popular navigation software installed on citizens’ smartphone. Instead of having static positions of vehicles, the location information is dynamic, following the smartphone’s location. Because the volume of signals from smartphones is huge, data preprocessing is needed to generate some indication of traffic jam information.
Currently, traffic prediction has been developed to deal with challenges and concerns as summarized in [
8]. With respect to an urban road network, as mentioned in [
8], the main challenges that traffic prediction faces include: (1) A more complex road network (large scale) should be considered due to the urban environment, (2) the model should include spatial-temporal (ST) characteristics of the traffic network to describe the corresponding conditions among multiple roads, and (3) utilization of artificial intelligence is still an emerging research area. How to specifically develop a deep learning framework in this area is an important research topic.
Understanding, monitoring, and predicting traffic road conditions, such as the level of traffic jam at certain times, are essential needs of urban transportation [
9]. Over the past decade, there have been many studies conducted on the traffic prediction model. Most of the existing models have employed statistical, time series based, probabilistic, and neural network approaches according to the reviews [
8,
10,
11]. The following models (1) time series analysis model, (2) traditional machine learning, and (3) deep learning based model, are summarized below.
(1) Time Series Analysis Model
Time series analysis approaches such as ARIMA [
12] and sliding window ARIMA (SWARIMA) [
9] have been used to study the traffic conditions for the purpose of predictions. Traffic flow data recorded from sensors are frequently noisy, and the short-term traffic prediction is considered as a nonstationary. To deal with this issue, Xie et al. applied a Kalman filter (KL) to handle nonstationary for short-term prediction [
13]. In their work, wavelet decomposition analysis was utilized to reduce the noise. Similarly, their model was applied to a relatively simple freeway road network. Yan et al. also proposed a traffic flow prediction based on multivariate time series to understand the traffic patterns on the freeway [
14]. In their work on traffic volume, occupancy, and speed, multivariate traffic time series were utilized and converted into a complex network structure.
Most of these approaches mainly focus on the following: (1) predicting traffic flow at a single data point, (2) considering traffic data as a sequence, and (3) finding the patterns of the temporal variation of traffic on one road segment. However, the road traffic condition actually can be propagated among the road network as mentioned in [
15]. This means that if one road is badly jammed, the traffic is propagated to other nearby roads which causes the network effect. Because of this propagation characteristic, the road network wide (RNW) traffic prediction which predicts traffic conditions of multiple roads at once is needed, not only based on the time series data analysis, but also the network structure of roads.
(2) Traditional Machine Learning Model
In addition to the traditional time series analysis, machine learning models such as support vector machine (SVM), neural network (NN), and deep learning (DL) network have been applied to the traffic prediction domain for decades. Tang et al. proposed a fuzzy neural network model to predict traffic speed by considering periodic characteristics [
16]. In their work, k-means was employed to extract periodic features of travel speed data that had been collected from three adjacent stations. Then, a trigonometric regression was used to predict travel speed for multi-step ahead. Although the proposed model performed well on single traffic road prediction, the model did not consider road network perspective, as mentioned before.
Many SVM-based models have been widely employed for traffic prediction. For example, Zeng et al. proposed AOSVR to deal with time efficiency of the traffic flow prediction [
17], Saldana-Perez et al. took advantage of social media data to characterize the traffic congestion and analyze crowd-sensed data from a geospatial perspective [
18]. Yan, H. and D.-J. Yu. proposed an improved SVM to classify traffic jam conditions which was able to handle a negative effect of an outlier on the traffic data [
19]. In their study, relatively small data were used, and therefore their model failed to deal with large-scale traffic prediction.
Furthermore, Tang et al. proposed a hybrid model of SVM with several denoising techniques to predict traffic volume at multiple ahead steps (2 min, 10 min, and 60 min time horizon) [
20]. Empirical mode decomposition (EMD), ensemble empirical mode decomposition (EEMD), moving average (MA), Butterworth (BW) filter, and wavelet (WL) were combined with SVM. According to their experiments, obviously, EEMD outperformed as compared with other denoising techniques. Although the proposed model performed well to predict multiple ahead steps traffic conditions, the spatial correlation between the detectors was not considered in their study. Moreover, the predictive model only focused on the freeway and not the urban environment.
(3) Deep Learning Based Model
Recently, more advanced approaches such as deep learning model have been applied to traffic prediction due to their promising performance. For example, Kim et al. utilized the much deeper and complex recurrent neural network (RNN) model to predict traffic speed [
21]. Abbas et al. used long short-term memory (LSTM) model, a type of RNN, for the short-term traffic prediction on the road [
22]. In addition, LSTM has also been used for travel time prediction [
23]. The RNN and LSTM models are state-of-the-art and powerful for capturing temporal features in traffic. However, the spatial interaction is not considered from the point of view of the road network, although the forecasting task is on a single road in a relatively small region.
Realizing local dependencies of a network can improve the prediction. Song et al.’s work predicted Seoul’s main road traffic speed on weekdays utilizing CNN and obtained better results than two multilayered perceptron network [
24]. Another research took advantages of both LSTM and CNN to capture a spatial-temporal correlation to predict travel time [
25]. Nevertheless, in the literature, few studies have considered the RNW as a whole to exploit the correlation of spatial-temporal features effectively and estimate the traffic interactions among the road segments on a large scale. Especially, for urban roads, knowing the traffic condition in a whole road network, instead of a single road, can result in better decision making by transporters.
To address this concern, Ma et al. proposed a novel approach to learn traffic data as image and applied CNN to predict traffic speed on roads network wide instead of single road segment [
26]. The model was evaluated on Beijing’s ring road network using traffic data from taxis’ GPS across the city and achieved 42% accuracy improvement as compared with other algorithms such as KNN, ANN, random forest, and least square methods. In addition, in studies by [
27,
28], CNN-based model was also compared to traditional ANN. Their findings confirmed that deep learning based model outperforms the traditional network (ANN), other machine learning methods, and statistical methods.
As evident in the literature, CNN has a drawback in the pooling operation which was addressed in Sabour et al.’s work [
29]. To tackle the limitation of CNN, Sabour et al. introduced a new type of neural network, called capsule network (CapsNet). A capsule is a group of neurons which has different properties from the same entity. It is trained by dynamic routing instead of max-pooling and has a different nonlinearity, namely squash. Since then, many research works applied CapsNet to train the prediction model, including the traffic prediction problems. Extending what Ma et al. has done, Kim et al., applied CapsNet to the traffic speed prediction problem [
30]. In their work, the max-pooling operator inside CNN was found to lead to information loss of the interaction among road characteristics in urban transportation. Therefore, replacing the max-pooling operator with routing by agreement algorithm in CapsNet improved the prediction result.
In this study, we develop a prediction model focusing on predicting traffic jam speed on urban roads based on the information collected from citizens’ smartphones. In order to differentiate the urban roads where this work focused from the free-flow and well-regulated road transportation, we introduce the term “urban swarming transportation” (UST). Essentially, in UST, the lane marks on most of the UST are not clear or even not existing. It means that all kinds of vehicles, pedestrians, and even animals share the same road. The traffic light system on UST in some developing countries is insufficient or not strictly followed by transporters.
Figure 1 shows a typical example of the UST condition (the photo was taken in west Jakarta). As shown in
Figure 1, the UST roads are swarmed with all kinds of the mentioned transporters. Note that the road in the picture is not “one-way” and all transporters can use this bidirectional and narrow road.
In order to handle the traffic data collected from mobile phones, data preprocessing is needed to map the traffic speed data originally with longitude and latitude, to the traffic measurement with the road sections. In this study, two deep learning methods, CNN and its extension, CapsNet, are used to train the prediction model to predict the traffic jam speed on the UST roads. Note that the CNN model is the benchmark of our work for comparison purposes. In addition, optimized CapsNet (named OCapsNet) architectures is proposed to change the ReLU nonlinearity at the first two layers of the convolution step. Edgar Squash is applied to the capsule layer in the original CapsNet. We propose the use of some strategies to tweak dynamic routing as mentioned in [
31,
32,
33], and therefore obtain better prediction performance.
The contributions of this paper are summarized as follows:
We used traffic data recorded by mobile sensors such as GPS, instead of fixed detectors on the road, as a cost efficiency for traffic prediction on urban roads under UST;
We proposed CapsNet-based traffic jam prediction as a comparable to CNN-based predictive model to deal with the RNW condition which is the complex road network and spatial-temporal traffic road characteristics under UST;
We improved the performance of CapsNet by utilizing nonlinearity function in the convolution layer of CapsNet to modify dynamic routing on the two-capsule layer of the original CapsNet.
This paper is organized as follows:
Section 2 addresses the techniques used for traffic data preprocessing.
Section 3 describes the details of traffic jam speed prediction tools.
Section 4 outlines the experimental setup in deep learning methods and the experimental results. Finally, the conclusion and future study directions are mentioned in
Section 5.
2. Traffic Data Preprocessing
In this research, one year of traffic jam speed data on urban roads in the Jakarta metropolitan has been used as a sample data of UST for studying. The data is collected by the governmental independent smart city division of Jakarta named Jakarta Smart City (JSC). The traffic speed in any Jakarta area is captured in every second interval from Waze mobile app. The measurement of 15 min and 5 min are used to aggregate the traffic jam records. According to information described by JSC, traffic speeds higher than 10 km/h can be considered as free flow (the traffic in Jakarta is extremely congested). Therefore, the collected raw dataset only keeps the traffic jam records. It should be noted that five traffic jam levels are associated with the traffic jam speed (lower than 10 km/h) of each record which comes with longitude and latitude position.
Table 1 shows examples of traffic jam speed data. The traffic speed associated with each traffic jam level is categorized as follows:
Level 0 interpreted as free flow (not recorded);
Level 1 is 6.1 km/h to 8.1 km/h of traffic speed;
Level 2 is 4.1 km/h to 6.1 km/h of traffic speed;
Level 3 is 2.1 km/h to 4.1 km/h of traffic speed;
Level 4 is bigger than 0.0 km/h to 2.1 km/h of traffic speed;
Level 5 is 0.0 km/h which is denoted as blocked.
Four relevant attributes from the urban traffic jam data were chosen in this research. They are time occurrence, traffic speed, longitude, and latitude of the traffic jam, as shown in
Table 1. In order to identify the traffic jam location on a certain urban road, the external road information offered by a public traffic road database (OSM) is used [
34].
The process of integrating datasets of traffic jam dataset (
Table 1) and OSM is shown in
Figure 2. The first step is to extract the coordinate information (latitude and longitude), and traffic jam measure from the dataset as a data point. Secondly, on the OSM, querying out 10 (or more, by setting) road segments which are near to the data point, and connecting the starting and ending coordinates of each found road segment as a line. Third, assuming each road centered with a line has a certain width, check if the point lies on the road. If yes, the traffic measure of the point can be associated with the found road segment. If no, then keep checking if the point can be associated with the other road segment. If one point does not lie on all found road segments, it means the coordinate of the point is too far from the road and can be considered as a useless point.
Every road section in the OSM database can be identified with an OSM ID. A single road with a road name may have multiple OSM IDs to represent road sections if the road is very long. Every road section can be broken into smaller road segments identified as Road ID. Adding OSM ID and Road ID, the traffic jam record’s coordinates (longitude and latitude) can be located at the same road segment’s nearby.
Table 2 shows the example of the integrated traffic data with OSM.
In order to emphasize the research problem, eight main roads in the JSC dataset are chosen, as listed in
Table 3. These roads located in the central Jakarta City were used to represent typical Jakarta’s urban traffic road with extremely jam-packed traffic conditions. It is noted that multiple kinds of vehicles and pedestrian are allowed to commute on the chosen roads which show a typical representation of UST traffic condition roads existing in most developing countries. Examples of the road sections (OSM ID) and the associated road segments (Road ID) are shown in
Table 4.
In this study, a prediction model was developed to predict traffic flow on the selected road segments which presents typical UST road conditions, as shown in
Figure 3, while the information of supplementary road segments is used as additional feed to the prediction. These supplementary roads are chosen by considering target adjacency and representing the main road where spatially correlated with each other. In this study, 2,131,584 rows of fifteen-minute interval data and 6,401,890 rows of five-minute interval data were used. Using these interval data, two datasets are prepared for the experiments. One smaller dataset contained 61 distinct road segments (Road ID) that represented eight distinct selected roads, as shown in
Figure 3a. Another larger dataset contained 2972 road segments on 5 × 5 km coverage area, as shown in
Figure 3b. Examples of data features used in this work are shown in
Table 5.