1. Introduction
Near-future predictions of traffic conditions across an arterial road network have been a fundamental part of intelligent transport system (ITS) technology for a few decades now. Analyzing and predicting traffic conditions in real time can effectively support urban road traffic management, resulting in reduced road disruptions and delays, providing congestion warnings as well as allocating resources for a safe and sustainable urban infrastructure. Nowadays, ITSs operate in conjunction with the Internet of Things (IoT) [
1] and big data analytics [
2] for effective urban traffic management, indicating the importance of two main aspects: (a) The traffic flow or volume analysis and prediction approaches applied; and (b) the traffic sensor infrastructure installed and used. Parallel to those two aspects, computational capacity has to be considered when integrating models and sensors for real-time applications and automated IoT sensing systems.
Traffic flow analysis and prediction have been a major area of interest within the field of ITSs since the late 1970s [
3]. Typically, the number of vehicles constitutes one of the main parameters to analyze urban traffic behavior, also indicative of traffic volume that is used hereafter. Additional parameters include type, height and other characteristics of a vehicle. Many approaches have been developed to extract traffic volume and other parameters from various sensors. The faster region based convolutional neural network (Faster R-CNN; [
4]) has been a well-established deep learning approach used for vehicle detection and classification from images of camera sensors set up at road intersections or from aerial platforms. The study in [
5] investigated the Faster R-CNN performance in vehicle detection by optimizing parameters for model fine-tuning. To run their experiments they used the KITTI open-source benchmark image datasets, developed by the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI; [
6]). Another benchmark image dataset is Common Objects in Context (COCO; [
7]) introduced by Microsoft. Both datasets contain images from the natural and built environment from various regions, used to train deep learning models. As a result, open to the public pre-trained frozen graph inferences have been established (e.g., in [
8]). These pre-trained models have supported research on vehicle count generation and vehicle classification. However, to accurately quantify the traffic conditions across a local arterial network, datasets obtained from the network’s infrastructure are essential.
Regarding traffic prediction approaches, neural network-based methods have been the most popular ones according to the review in [
9]. For instance, the study in [
10] has developed a long short-term memory (LSTM) model, which is a type of advanced recurrent neural network (RNN), for predicting vehicle speeds on expressways using data from roadside loop detectors. More recently, the authors in [
11] have combined multiple LSTM models with k-nearest neighbor (KNN), a traditional machine learning (ML) approach, to predict traffic using data from nearby loop detectors with high spatiotemporal correlations. Additionally, the authors of [
12] developed a stacked auto-encoder model that includes multiple layers of contemporary neural networks to predict traffic on freeways. In all these studies, the authors have reported that their advanced proposed architectures outperform other simple approaches with an approximate average accuracy improvement ranging from 3% to 20%.
It is notable that the aforementioned studies [
10,
11,
12] demonstrate exceptional examples of innovative model development. However, it is not always guaranteed that advanced deep learning models can successfully be applied to any type of traffic data. According to [
13], simple architectures can sometimes work more efficiently than complex advanced methods. The latter usually demand a series of “trial and error” tests for tuning parameterization, increasing their life cycle cost [
13]. The choice of prediction model is strongly dependent on the type of prediction problem and the characteristics of the traffic data used as input [
13]. Nevertheless, compared to traditional ML methods, deep learning models are not easily interpretable [
14], hence, expert knowledge is often required [
15].
In terms of conventional traffic sensor infrastructure, numerous studies have extensively used in-ground or roadside inductive loop detectors for traffic prediction [
2,
12,
16,
17]. Other popular traffic sensors include the global navigation satellite systems (GNSSs) embedded in smartphones [
18] or those installed in taxis [
19]. However, the aforementioned sensor infrastructure requires a particular installation and can be relatively costly for traffic management bureaus when installed at multiple locations across an entire city or in hundreds of taxis. An alternative low-cost [
2] and widely available sensor infrastructure can be closed-circuit television (CCTV) systems which have been primarily employed for traffic surveillance, vehicle detection (e.g., automatic number plate recognition (ANPR) systems; [
20]) and tracking [
21] as well as event recognition applications [
22,
23,
24], but not explicitly used for traffic prediction. Compared to studies using loop detectors and GNSS sensors, relatively little research on prediction has been conducted with CCTV datasets in recent years (e.g., in [
25]). In addition to this, more published research has been conducted on highways or freeways whilst urban traffic prediction is yet to be investigated fully, as only one-third of published work is focused on urban arterials, as recently reported in [
9]. To facilitate such research in urban environments, CCTV datasets have recently become freely available from many local authorities in the UK through initiatives such as the Urban Observatory (UO) project in the North East of England, hosted by Newcastle University [
26].
With the emergence of deep learning technology, a considerable literature has grown up primarily around the development of novel individual architectures. On the one hand, this, together with the freely available benchmark datasets (e.g., COCO) and raw sensor observations (e.g., CCTV image series), has led to the growth of numerous open-source libraries consisting of state-of-the-art object detection and time series prediction approaches (e.g., in [
8,
27]). On the other hand, as also discussed in a very recent study in [
28], to apply deep learning approaches for real-time predictions requires high computational capacity to train and update prediction models when new real-time traffic information is retrieved. Whilst there is a plethora of advanced approaches, a seamless practical workflow for both traffic flow detection and prediction with minimal computational cost is yet to be further developed. Moreover, combined traffic detection and prediction from raw IoT sensing data would significantly benefit traffic monitoring and management, especially when used on integrated platforms where raw data, detections and predictions can be explored and visualized.
To that end, the presented research aims to develop an end-to-end automated CCTV-based traffic volume analysis and prediction framework that is computationally fast and effective to be potentially used for near real-time applications. The main motivation of the research is to take advantage of commonly available raw IoT CCTV imagery alongside advanced algorithms within an integrated pipeline (hence the term “end-to-end”) to provide a twofold outcome: (a) Quantification of urban traffic and (b) estimation of future traffic conditions. This framework is intended to support the decision-making process in a local traffic bureau for proactive actions under disruptive circumstances. Specifically, the framework incorporates state-of-the-art CNNs for generating vehicle counts as identified in CCTV image series, quantifying the arterial traffic volume conditions of the North East region, UK. It then utilizes free and open-source libraries for three models (i.e., one statistical-based model, one machine learning model and one deep learning model) to predict traffic volume at multiple locations across the North East. Tests assess the three prediction models at six locations with different lengths of historical vehicle counts and by incorporating calendar attributes as well as spatio-temporal information from other nearby CCTV cameras. Additionally, a use case of the framework is demonstrated for a six-day period to fill gaps when data are missing from the CCTV image series. The possibility of framework integration with an online demonstrator is also explored.
The main contributions of the study are as follows:
To demonstrate the use of raw CCTV images for traffic prediction in complex urban areas within a full end-to-end framework;
To provide constantly updated traffic volume (i.e., vehicle counts) as an open-source dataset to the general public, traffic managers and the research community;
To develop an efficient traffic detection and prediction framework with the potential for near real-time implementation, such as integrating into a live online platform.
The remainder of the paper is organized as follows: In
Section 2, related work is described. In
Section 3, the methodology is presented, including the data used, the developed framework and the experiments conducted.
Section 4 demonstrates the results of traffic prediction per experiment. In
Section 5, a discussion of the results is presented with the future directions of the developed framework and
Section 6 concludes the main findings of the work.
3. Methodology
3.1. Data Description
Tests were conducted at six CCTV locations in a part of the North East urban arterial road network in Newcastle upon Tyne and Gateshead, UK. CCTV raw datasets were retrieved from North East Combined Authority (NECA) Travel and Transport Data [
53], UK. These datasets are freely available and, together with the CCTV locations, can be retrieved from the Urban Observatory (UO; [
54]) API. The chosen CCTV cameras are located on roads with different classifications, such as A roads (e.g., A1058) and B roads (e.g., B1305), with various traffic conditions and volumes.
As CCTV cameras are set up by NECA operators to automatically switch views every few minutes, an image is captured from one of a number of different directions every time the camera turns. The number of views per CCTV differs. In addition, as every CCTV has a set of specific views, and vehicle counts of a single camera do not necessarily follow an identical pattern in all views (e.g., having city center surface parking in view provides high counts in off-peak hours). It should be noted that the CCTV cameras are categorized per location and not per view on NECA and UO web platforms. Due to this particular setup, the estimated vehicle counts have an irregular time interval and can often include gaps.
3.2. Workflow and Experiments
The overarching end-to-end framework of the CCTV-based traffic volume analysis and prediction is schematically presented in
Figure 1. CCTV-based vehicle counts detected with a fine-tuned Faster R-CNN constitute the current state of traffic volume. Those generated vehicle counts then feed the prediction analysis which can support decision making before, during and after a disruptive event via an online platform.
The methodological workflow of the framework is broken down into five stages, as follows: (1) Fine-tuning of pre-trained Faster R-CNN; (2) estimation of vehicle counts and post-training evaluation; (3) normalization of traffic volume data; (4) model parameterization, training and prediction; (5) implementation. The first two stages refer to vehicle count generation from CCTV images which characterize the traffic volume. Stage 3 prepares traffic datasets for model training and prediction. The final stage of implementation demonstrates a use case of the workflow for a six-day period. It also explores integrating the fine-tuned models for traffic volume detection and prediction on web-based platforms (e.g., Flood-PREPARED architecture [
55]). Note the explanation of the integration is out of the scope of the presented work, and will be covered in a future study.
Regarding the first two stages, preliminary investigations were carried out in [
30] to analyze the optimal deep learning neural network for vehicle detection in CCTV images with respect to precision and recall. The results showed that the fine-tuned Faster R-CNN model provided a better harmonic mean (F) (80%) than the fine-tuned SSD MobileNet, but with a relatively unsatisfactory recall (69%). Therefore, additional tests involving the development of a frozen inference graph model with improved performance in vehicle detection are carried out in this study.
In stage 4, three prediction approaches from parametric and nonparametric categories were tested, namely: SARIMAX, RF and LSTM. These were assessed under two different scenarios: Firstly using one-month (from midnight 1 July to midnight 2 August 2019) and four-month (from midnight 1 April to midnight 2 August 2019) periods of training/validation datasets and secondly including spatio-temporal time series from neighboring CCTV sensors via an OD matrix. To keep the consistency across all experiments for direct comparison, 10% of the experiment dataset served as a validation dataset. After identifying the optimal settings per prediction model, the models were re-trained using the entire dataset without omitting the 10%. In all experiments, final predictions were evaluated with respect to detected vehicles counts, on 2 August 2019 from 06.30 to 19.00, which constitute the “ground truth” test dataset. A final prediction assessment was also conducted over the period 3–9 August 2019 and included in stage 5 of the methodological workflow.
3.3. CCTV-Based Traffic Volume (Stages 1 and 2)
Additional tests using the same computing platform to those described in [
30] assessed the performance of two model types with precision, recall and F, reported in
Table 1. More information about the calculation of the aforementioned metrics can be found in [
30]. The Faster R-CNN ResNet 101 and the Faster R-CNN Inception V2 are the two model types tested. Pre-trained frozen inference graphs of those types with COCO datasets were retrieved from [
8]. The Faster R-CNN architecture consists of a convolution layer where a 600 × 1024 image is fed through, followed by a pooling, a fully connected and a softmax layer. A 0.7 and a 0.6 intersection over union (IoU) threshold was used in the initial and second stage of the model post-processing, respectively. The maximum number of region proposals was set at 300, the localization loss weight was equal to 2 and the classification loss weight was set to unity. Most parameters were configured as the default, based on the configurations of pre-trained models in [
8]. A learning rate of 0.0002 and a 0.9 momentum optimizer were used during the model development with the different numbers of epochs per model indicated in
Table 1. To fine-tune the pre-trained models, a total of 1269 NECA CCTV images from the year 2018 were used. Fifty NECA CCTV images which were not included in the training dataset were used as ground truth.
Any type of vehicle was manually labeled as a single class by two operators; the first one labeled 569 images and the second operator 700 images. After conducting six tests, as seen in
Table 1, it was found that mixing the training images from two operators increased the number of false negatives with a low recall and harmonic mean. This may have been caused by unlabeled vehicles in the set of 700 images, resulting in false negatives. The Faster R-CNN Inception V2 model provided the highest number of identified vehicles and the best recall and harmonic mean among all tests.
Figure 2 shows the vehicle detection results of the NC_A1058C1 CCTV camera from two different views.
Figure 2a illustrates that nine vehicles were identified, with 19 missed because they were relatively small in size in the background of the image. Droplets on the CCTV sensor deteriorated the image quality. In
Figure 2b, the fine-tuned model identified nine out of 10 vehicles from a second view of the NC_A1058C1 camera approximately 40 min later when the rain had stopped. It should be noted that the traffic volume estimated here incorporates vehicle counts from different lanes and directions, as CCTV cameras alter views (
Figure 2). Moreover, the fine-tuned model required circa 0.4 s per image to detect and record the number of vehicles as well as export a single image, as in
Figure 2. For that, an i5-6500
[email protected] Ubuntu 16.04.5 LTS with a graphics processing unit (GPU) Quadro P4000 was used. The vehicle count time series generated with the Faster R-CNN fine-tuned model served as training, validation and “ground truth” test datasets in the experiments, presented here in stages 4 and 5.
3.4. Data Normalization (Stage 3)
Prior to any model training and prediction, data normalization (stage 3) is implemented, as follows:
(1) CCTV cameras with which prediction is planned to be modeled are designated as target cameras. Six target cameras are used in all experiments here, but the framework can be designed to select more CCTV cameras;
(2) To ensure regularity, time series are aggregated by calculating the average vehicle count over a time period. This aggregation period can overcome the challenge of varying camera views per location. The selection of an optimal aggregation period is described in stage 4;
(3) A filtering process flags those cameras with null counts for two consecutive days, excluding them from becoming target cameras, as they do not provide adequate observations for model training. The same filtering process excludes cameras with counts of an identical value for two consecutive days, to remove noise from the time series to be used as a training dataset;
(4) Filtered time series are reshaped to data formats suitable for SARIMAX, RF and LSTM algorithms, with respect to a specified aggregation period, the input exogenous attributes and a past sequence. The latter constitutes a time series in the past that can be used as input for predicting one step forward in the future.
One month of traffic volume was used for training (
Figure 3). The 25% highest flow on weekdays and weekends is equivalent to the 3rd quartile of the July 2019 dataset. Intuitively, this indicates variations in traffic behavior during peak hours due to people driving from/to work. Because the chosen CCTV sensors represent traffic volume for roads of different classifications, the number of vehicles varies, as seen on the y-axes in
Figure 3. For instance, the NC_B1307B1 CCTV showed the lowest variation between weekends and weekdays compared to the NC_A695E1 CCTV (
Figure 3). Similarly, different low and high peaks were observed at the six CCTV sensors in a single day. To accommodate such daily and weekly temporal patterns in the prediction models, four exogenous factors were added in the input time series as follows: (1) Weekend or weekday; (2) day of the week; (3) period of the day before; and (4) after midnight.
3.5. Evaluation Metrics for Prediction
Three error metrics were adopted to assess the prediction models’ performance, namely, (a) mean absolute error (MAE); (b) the mean absolute percentage error (MAPE); and (c) the root mean square error (RMSE). These were calculated as follows:
where
N is the length of evaluation data and
xi is the measured and
yi the predicted value of the
ith observation. It should be noted that no issue with division by zero in (1) was encountered, as vehicle counts were always greater than zero during traffic hours (06.30–19.00) for the test dataset.
3.6. Prediction Model Parameterization (Stage 4)
To identify optimal parameters per model before conducting the actual experiments, a tuning process was carried out using 90% of the one-month (July 2019) datasets as training data and 10% as validation data.
The tuning process for RF model development involved the setting of four parameters as follows: Number of trees; maximum depth of a tree; minimum number of samples to split an internal node of a tree; and minimum number of samples at the leaf node (i.e., the tree base), as defined in the Python scikit-learn library [
56]. The tuning process was applied to the one-month time series of the NC_A167E1 CCTV (
Figure 3). The process also examined the RF prediction performance using different hours and minutes for past sequence and aggregation periods, respectively, as seen in
Table 2. To estimate the best combination of parameters, a grid search algorithm was adopted using a 3-fold cross validation strategy in GridSearchCV [
57] calculating the MAE, as reported in
Table 2. The 3-fold strategy performs cross-validation after fitting a number of RF models equivalent to the number of candidates (i.e., various combinations of parameters) multiplied by three. For instance, 9600 candidates were cross validated three times for tuning with 30 min aggregated training data, which required 10 h on an i5-6500
[email protected] Ubuntu 16.04.5 LTS. MAPE and RMSE metrics were calculated against the 10% validation dataset of the one-month time series.
As evidenced in
Table 2, all evaluation metrics became minimum when data were aggregated over 60 min, regardless of the given past sequence. However, such a period smoothed the traffic pattern and disregarded subtle traffic behavior that occurred within an hour. In contrast, due to higher temporal resolution, a 15 min aggregation period could incorporate abrupt changes in traffic behavior, but with high MAE and RMSE metrics close to unity in most cases. To overcome this, a 30 min aggregation period was considered as the trade-off between performance and the sufficient capture of unexpected short-term traffic behavior. This time period was also considered suitable as it could accommodate the challenges of varying views in the CCTV images. For the 30 min aggregation period, all three metrics were lower when 12 and 24 h past sequences were used. Intuitively, when a longer past sequence is fed into a model, the prediction can be more stable. Hence, a past sequence of 24 h and a 30 min aggregation period were selected for all experiments. Based on those variables, the tuning process was repeated for the remaining five CCTV locations. Afterwards, the input time series, together with the exogenous factors, were reshaped into a one-dimensional array and fed into the RF regressor for training using the tuned parameters reported in
Table 3.
Regarding the tuning process for SARIMAX, there are seven parameters to specify. The model is typically denoted as SARIMAX (p, d, q) × (P, D, Q) [S] where: S refers to the number of periods per season; p is the autoregressive term expressing the number of lagged observations; d is the integrated term indicating the degree of differencing of raw observations; q is the moving average term; and uppercase P, D and Q are the equivalent autoregressive, integrated and moving average terms of the model’s seasonal part. More details on the SARIMAX model can be found in previous studies [
35,
39,
40,
42]. An inspection of the time series for the year 2019 ensured that there was no apparent downward or upward trend in the datasets with respect to the four annual seasons. However, as evidenced in
Figure 3, a daily seasonal pattern was observed, implying that S is 24 h, equivalent to 48 for a 30 min aggregation period per day. Similar to [
35], since there was no trend in the datasets, adopting S = 48 ensured that the time series became stationary with a lag (i.e., order of difference) equal to one day. This is in line with the selected past sequence of 24 h, as explained previously.
As also suggested in [
42], the autoregressive and partial autoregressive correlation function plots were utilized to define a range of candidate values (between 0 and 5 for the six seasonal and non-seasonal parameters). A grid search algorithm was adopted as found in the statistical “pyramid” Python library [
58]. This uses Akaike’s information criterion (AIC) to identify the optimal set of SARIMAX parameters with the best fit to the provided datasets (for AIC calculation, see also [
35]). The grid search was applied to the datasets of the six CCTVs and the best combinations of parameters are listed in
Table 4.
Regarding the LSTM model development, various tests for tuning hyperparameters were carried out using the one-month time series for the NC_A167E1 CCTV camera. Initially, based on trial and error, two LSTM layers using the ReLU activation function were found to be suitable, with a single dense layer and a single dropout layer with a linear activation function following afterwards. The Adam optimizer was adopted with a learning rate set at 0.001 and the mean squared error used as the loss function during training. The one-month time series data were reshaped into a three-dimensional array as required for training and validation with a 128 batch size. For instance, a validation dataset with a length of 140 observations (i.e., 10% of the one-month dataset) was reshaped into (90,49,5) where 90 is the number of data samples, 49 the length of the past data sequence including the last 30 min time step (i.e., a window of 24 h) and 5 the number of input attributes including the time series of the NC_A167E1 camera alongside the four exogenous factors.
An additional test investigated the optimal number of neurons of the two LSTM layers, applying a grid search for a range of [10–700] neurons run for 100 epochs. MAE, MAPE and RMSE were calculated with respect to the measured traffic volume of the validation dataset. As seen in
Figure 4, there was no significant increase in the metrics’ magnitude when hundreds of neurons were set for the two LSTM layers. However, a slight upward trend was observed for MAE and RMSE values. MAPE was not plotted in
Figure 4 because its magnitude did not differ more than ±0.02 from a value equal to 0.21. Based on this test, 60 neurons were chosen as units in the LSTM layers, as they provided minimal values for all three metrics (MAPE = 0.20, MAE = 0.60 and RMSE = 0.98).
It should be noted that for the LSTM model development and prediction, a Python script was retrieved from [
27], and was amended accordingly. The script utilizes Tensorflow and Keras routines for LSTM modeling. After defining the LSTM structure, six models were trained for 200 epochs, separately, per CCTV location with the input being all the observations for one- and four-month datasets per experiment. With a trial and error procedure, it was found that 200 epochs were sufficient, resulting in a low loss value. Overfitting was examined by monitoring the loss of the validation dataset during the aforementioned tests for the LSTM structure development. An example of a loss curve for the NC_A167E1 CCTV camera is shown in
Figure 5. The algorithm picks the training epoch with the lowest loss value to build the LSTM model.
Regarding time consumption for model training, approximately 4 min and 15 min per CCTV sensor were required for a 200-epoch training of an LSTM model with one month and four months of data, respectively. The tests were conducted with a GPU Quadro P4000 in Ubuntu 16.04.5 LTS. Without the use of a GPU, approximately 1.2 and 4.7 min were required to train an RF model with one month and four months of data, respectively.
3.7. Origin–Destination (OD) Matrix
To incorporate spatial dependencies from neighboring CCTV sensors into traffic predictions, an OD matrix was firstly structured across the 219 CCTV locations in the North East region. This was achieved with the aid of the network analyst tool in ArcGIS from ESRI [
59]. An arterial road network model was built using A and B roads supplied by Ordnance Survey [
60]. A part of the road network is mapped, shown with blue lines, in
Figure 6.
The OD matrix was constructed based on the calculated shortest routes along the road network between all CCTV locations. This was then imported into the data preparation step to select traffic data from the four closest CCTV sensors that are as evenly distributed as possible across the north, east, south and west directions from the target camera. Prior to this selection step, the traffic data were filtered out from the noise, as described in stage 3: Data normalization. The filtered time series of traffic data from the selected nearby CCTV sensors was considered additionally to the four exogenous attributes. The selection process is automated and follows three steps as below:
(1) Select a camera closest to pre-defined distance from target cameras (0.20 m was used in the experiments here);
(2) Calculate ideal bearings to cameras based on the desired number of cameras to include in the final training set (e.g., if 4 cameras are to be selected, then the angle between them should be 360/4 = 90 degrees) The bearing is calculated based on the aforementioned angle (e.g., when the first bearing is at 45 degrees from the target camera, then the other three bearings would be 135, −135 and −45 degrees);
(3) For every bearing left, rank all candidate cameras used for training except the first selected (in step 1) based on how close they are to the desired distance and how close they are to the desired bearing and apply weights to choose the best-scoring camera. Distances and bearings are normalized between 0 to 1 and expressed as below:
where
d and
b are the calculated distances and bearings between candidate CCTVs with
Wd and
Wb as their corresponding weights. Here, weights of 4 and 1 are used for distance and bearing, respectively. These values were chosen after a trial and error process as they seemed most suitable to identify the closest cameras to the target one.
The inset map in
Figure 6 shows an example for the NC_B1307B1 target camera with PS191, NC_B1307A1, NC_GNSA1 and PS193 as the assigned four nearby CCTV cameras for the one-month time series. It should be noted that the PS196 camera included null traffic data for the specified training periods, hence, it was excluded from the selection step, even though it was the closest to the target camera, among others. Additionally, there were no CCTV cameras located in the east direction close enough to be selected, therefore the PS193 camera was considered as the fourth choice.
5. Discussion
In terms of assessing three different prediction approaches, the presented experiments have demonstrated that the SARIMAX model failed to accurately capture the traffic conditions in all six CCTV locations, despite the ease of its implementation. However, LSTM and RF provided predictions closer to the “ground truth” data but required demanding preparations to implement their algorithms. With regard to computational efficiency, RF does not require GPU and runs faster compared to LSTM. This is advantageous for the deployment in real-time visualization platforms such as the Flood-PREPARED architecture/web-based dashboard. Currently, the RF algorithm can build multiple prediction models for different locations at once, when the model parameters per CCTV sensor are identified in a previous step. Moreover, future work to include weather conditions, social events and holidays as additional exogenous attributes will be investigated. This would enable traffic volume predictions to ultimately be associated with more realistic contextual information, providing a better understanding of the impact on the city’s road infrastructure before, during and after a disruptive event.
Regarding the three predictors’ performance, RF delivered the most consistent results across various experiments. The inclusion of a four-month time series for training significantly improved LSTM predictions. In addition, RF predictions became more accurate, especially in capturing traffic peaks. The further inclusion of detected vehicle counts from four neighboring CCTV locations showed a general consistency in RF outputs and the model worked sufficiently well when missing data were observed due to CCTV camera shutdowns. A combination of other approaches, such as KNN, with the OD matrix in the RF model, could potentially improve the current selection process of nearby CCTVs and include only correlated spatio-temporal information in the traffic prediction. Overall, it was shown that the RF machine learning method constitutes a simple and fast method for a real-time application, requiring less computational demand, while also not requiring a GPU as in other deep learning methods.
Regarding the overall prediction accuracy, a validation process was applied at two different stages of the framework. Firstly, the vehicle detection outcome, modeled with the fine-tuned Faster R-CNN, was evaluated with 50 NECA CCTV images, in which manually identified vehicles were used as ground truth. This is a common procedure for object detection evaluation, while the ideal scenario would be to have a number of vehicles as ground truth measured by other means, such as inductive loops. This setup was not feasible at the time the experiments took place. However, after experimentation with different numbers of images and epochs for training, the fine-tuned Faster R-CNN Inception V2 model provided a harmonic mean greater than 90% with the highest number of detected vehicles (
Table 1). This was indicative of a successful vehicle detection performance, sufficient to be used for the next phase of prediction. A second validation step was then applied to evaluate predicted values of the three approaches based on results produced by the chosen fine-tuned Faster R-CNN Inception V2 model. That way, even in the case of a poor vehicle detection result, the following prediction accuracy would still be high, under the condition of a well-performed prediction algorithm like the RF or LSTM models. In the case of the SARIMAX model, a poor prediction would not be attributed to the low vehicle detection accuracy but to the underperformance of the prediction model itself. In other words, a poor vehicle detection accuracy does not adversely affect the overall prediction outcome in the developed framework.
Regarding future directions, the end-to-end prediction framework can be potentially implemented at any CCTV location on the region’s IoT infrastructure, providing a useful tool for the traffic management bureau. Through providing real-time data to decision makers prior and during a disruptive event (e.g., flood) it could enable more effective responses to such events. To that end, the integration with a web-based visualization platform such as the Flood-PREPARED dashboard will allow users to easily access up-to-date data on current traffic volume as well as predictions ahead of time, along with other contextual information, providing detailed data for city traffic managers, emergency planners and others responding to an event to better inform response decisions. It should be mentioned that for real-time web-based applications, it is important to find the balance between prediction accuracy, computational capacity and cost. Hence, a simple machine learning method could be implemented more easily than a more complex deep learning method provided that accuracy levels are sufficient and the web-based integration follows a low-cost solution to ensure the longevity of the real-time application.