Highly Accurate Deep Learning Models for Estimating Traffic Characteristics from Video Data

Cai, Bowen; Feng, Yuxiang; Wang, Xuesong; Quddus, Mohammed

doi:10.3390/app14198664

Open AccessArticle

Highly Accurate Deep Learning Models for Estimating Traffic Characteristics from Video Data

by

Bowen Cai

¹,

Yuxiang Feng

¹

,

Xuesong Wang

² and

Mohammed Quddus

^1,*

¹

Department of Civil and Environmental Engineering, Imperial College London, London SW7 2AZ, UK

²

School of Transportation Engineering, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8664; https://doi.org/10.3390/app14198664

Submission received: 29 July 2024 / Revised: 27 August 2024 / Accepted: 20 September 2024 / Published: 26 September 2024

(This article belongs to the Special Issue Applications of Artificial Intelligence in Transportation Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Traditionally, traffic characteristics such as speed, volume, and travel time are obtained from a range of sensors and systems such as inductive loop detectors (ILDs), automatic number plate recognition cameras (ANPR), and GPS-equipped floating cars. However, many issues associated with these data have been identified in the existing literature. Although roadside surveillance cameras cover most road segments, especially on freeways, existing techniques to extract traffic data (e.g., speed measurements of individual vehicles) from video are not accurate enough to be employed in a proactive traffic management system. Therefore, this paper aims to develop a technique for estimating traffic data from video captured by surveillance cameras. This paper then develops a deep learning-based video processing algorithm for detecting, tracking, and predicting highly disaggregated vehicle-based data, such as trajectories and speed, and transforms such data into aggregated traffic characteristics such as speed variance, average speed, and flow. By taking traffic observations from a high-quality LiDAR sensor as ‘ground truth’, the results indicate that the developed technique estimates lane-based traffic volume with an accuracy of 97%. With the application of the deep learning model, the computer vision technique can estimate individual vehicle-based speed calculations with an accuracy of 90–95% for different angles when the objects are within 50 m of the camera. The developed algorithm was then utilised to obtain dynamic traffic characteristics from a freeway in southern China and employed in a statistical model to predict monthly crashes.

Keywords:

computer vision; FairMOT; speed; affine transformation matrix; crash prediction models

1. Introduction

Unstable traffic dynamics that may lead to road traffic collisions is a complex phenomenon involving the interaction of road geometry, environment, traffic dynamics, vehicles, and humans [1,2]. Due to obstacles and inconveniences in acquiring traffic data, conventional freeway crash prediction models rarely incoroporate real-time traffic dynamics (e.g., speed, speed variance, traffic flow, vehicle types) as explanatory variables into their analyses. However, atypical changes in direction and speed which are indicated by traffic dynamics, usually place other road users in jeopardy of collision [3]. Much evidence shows that traffic conflicts are crash precursors, and traffic dynamics are therefore related to vehicle crashes [4]. Recently, high-resolution detectors have provided abundant data regarding vehicle speed, lane traffic flow, vehicle types, and weather information, which makes real-time crash prediction possible and more accurate.

However, acquiring real-time traffic motion parameters by means of high-resolution detectors such as LiDAR, radar, and controller coil may not be practical in many cases. First, unlike surveillance cameras, LiDAR, radar, and inductive loops only cover key road segments. Most typical road segments are not covered. Second, installation of LiDAR, radar, and inductive loops is costly and labour-intensive, whereas extracting traffic motion parameters from roadside surveillance cameras is more effective and straightforward. Third, LiDAR, radar, and inductive loops cannot acquire certain motion parameters, such as lane change behaviour, vehicle trajectory, detecting fire, and lane occupancy. Taking the above-mentioned factors into consideration, it would be more desirable if various traffic motion parameters could be directly extracted from video data.

A plethora of roadside surveillance cameras have been installed to observe traffic. In China, there are 626 million surveillance cameras in total, of which around 24 million are installed on roadways [1]. However, fewer than 2% of surveillance cameras are smart cameras that can detect vehicle speed and traffic violations [1]. Most roadside cameras are not intended to observe traffic operating features at the time of installation. They are typically used by operators who can tilt, pan, and zoom to change the camera calibration [5]. The combination of movable cameras and lack of calibration makes estimating speed and volume and detecting abnormal running status impossible [5]. Nevertheless, recently developed deep learning and computer vision techniques provide the possibility of obtaining real-time traffic operating parameters, such as traffic incidents, traffic flow, abnormal scenarios, speed, weather detection, congestion, firing, human walking, vehicle trajectory, and vehicle types from the bare and simple roadside camera [6]. By making use of prompt traffic information, it is possible to construct an intelligent alert system that informs not only the control management centre of current traffic operating conditions but also builds a forecasting system to predict an upcoming crash.

The fairness multi-object tracking (FairMOT) network has recently been initiated to combine the detection and re-ID stages and has achieved high efficiency and effectiveness in performing an object tracking task. The FairMOT network structure could be adopted to track vehicles and reconstruct their trajectories [7]. To improve FairMOT re-ID ability in traffic scenarios, a modified version of FariMOT with CSPDarknet53 (a cross-stage partial DarkNetwork with 53 layers) is proposed in this study. A comparison of the modified FairMOT and other conventional methods is presented in Table 1.

This study applies a modified fairness multi-object tracking (FairMOT) network and derives a spatial transformation matrix to calculate world coordinates from an image plane to estimate lane-based traffic volume, individual vehicle speed, vehicle types, and vehicle trajectory in real time. The video processing algorithm developed in this study is robust and stable for any conditions such as day and night, bright and foggy, sunny and rainy. It does not rely on any facilities such as radar, nor does it require calibration measurements, for example, of speed cameras. The algorithm is flexible, and this can be used to analyse both real-time and historical video data. The initiated algorithm makes good use of roadside cameras. A great quantity of roadway operating data provides valuable information for freeway management and crash prediction.

The key contributions of the paper are summarized as follows:

This paper introduces a new affine transformation from the image space to the world space, which is deduced from the referenced literature. The projection matrix is generated according to road markings to obtain a representation of the offset of each vehicle’s position relative to the camera for speed evaluation [10].
A modified fairness multi-object tracking (mFairMOT) network is applied to strengthen the ability of a typical FairMOT network to detect and track traffic conditions. When utilising mFairMOT and a spatial affine transformation matrix, it is feasible to detect various traffic parameters such as individual vehicle speed, traffic flow, vehicle type, and abnormal incidents like fire, congestion, and parking.
The developed algorithm can be used not only for extracting real-time traffic parameters but also for historical video data without requiring any calibration.

In order to showcase the practicality of obtaining dynamic traffic parameters, video data were gathered from a freeway in southern China to develop crash prediction models. More specifically, one year of video data were processed using the proposed deep learning methods, and the number of crashes was obtained from the local traffic police stations. A rolling base zero-inflated logarithmic link for count time-series (ZILT) model with a negative binomial distributional assumption was built for daily crash predictions [11]. It was found that dynamic traffic features play an important role in crash predictions, as crash characteristics are largely related to traffic operations.

2. Literature Review

2.1. Sensor Dependent Detection Methods

Roadway traffic safety management and its automation have received increasing attention. An integral aspect of traffic safety management is accident detection, as vehicle collisions on highways typically result in the obstruction of a lane, inducing an increase in traffic volume in remaining lanes, leading to the cumulative risk of subsequent accidents [12]. Factors influencing traffic accident risk range from environmental factors to temporal–spatial patterns. Previous studies illustrate the magnitude of influence of certain factors on overall accident rates and have devised data models to predict risk within a specified segment based on available data. Some conventional proactive accident detection approaches are centred around sensor-recorded data or environmental factors such as precipitation, traffic volume, and road geometry [13,14,15]. Others have identified non-physical patterns regarding accident rates, such as the spatio-temporal correlation of traffic accident occurrences and traffic risk’s non-uniform distribution over time, and established predictive models based on traffic accident data [16].

Despite serving as a predictive model from environmental data, such models do not define real-time cumulative risk (from relevant aspects) to be referenced. Specifically, environmental factors such as road design and spatio-temporal patterns are invariant in traffic flow and thus fail to significantly reflect the state of traffic during which risk inferences are made [16]. In addition, sensor-collected data are integral to some of the proposed models (radar, inductive loops, etc.).

Despite extensive research on sensor-dependent detection methods, the implementation of such methods has encountered numerous obstacles. For instance, the expense required for certain hardware installations often renders such approaches unfeasible on a large scale [17]. This deficiency is aggravated by the prevalence of vehicle-based approaches that require data collected from sensors installed on each vehicle to achieve an optimal effect [18].

2.2. Video-Based Computer Vision Detection Methods

Video footage-based detection methods manifest superiority in terms of financial cost and hardware efficiency. Video-based detection algorithms, especially online-compatible ones, are trivial to integrate into pre-existing traffic surveillance cameras on highways, negating all external hardware costs. With the prevalence of existing traffic cameras (estimated at 24 million installations in China, though <1% of these have smart monitoring algorithms implemented) [1], incorporating automatic detection methods relieves dependence on manual monitoring, yielding a significant increase in the efficiency of each installation. In addition, sensor-based approaches are more limited in their range of applications and are often restricted to carrying out inference/analysis regarding one particular aspect. For instance, radar-based accident prevention approaches are limited to distance evaluation only [19]. Video-based detection methods enable more conclusive analysis of the collected region, such as abnormality identification (e.g., traffic accidents, incorrectly parked vehicles, off-lane vehicles), optic flow tracking of vehicles, and traffic density analysis. Multiple sensor installations, each dedicated to one of the above tasks, are not needed due to the versatility of video-based detection [20].

Previous studies have explored the benefits of video-based analysis algorithms and their application potential. Sable et al. introduced a simple online vehicle contour-based density measurement method [21]. Damulo et al. proposed a traffic camera-oriented density detection approach based on the Haar cascading algorithm within a designated region of interest (ROI) [8]. Deficiencies of vision-based analysis methods, such as small object detection and environmental influence, have been addressed and mitigated in various vision-only detection studies [22]. Tian et al. developed a traffic accident detection method based on neural networks and cooperative vehicle infrastructure systems (CVIS) [23]. By incorporating multi-scale feature fusion in the proposed YOLO-CA network, the model yields an improved ability to detect smaller/more distant objects, addressing the deficiency of small object detection common to vision-based object detection methods. Feng et al. extracted traffic flow from low-resolution surveillance cameras using a fine-tuned Detectron2 model [24]. Wang et al. addressed vision-based detection in low-visibility conditions by utilizing retinex image enhancement to minimise the impact of severe weather conditions on detection results [25]. The enhanced image input undergoes object detection and determines the accident probability via a decision tree. Yao et al. showed the feasibility of vehicle-based autonomous detection methods (as opposed to implementing detection by traffic cameras) and designed a first-person risk evaluation system based on anomaly scores derived from motion inconsistencies of preceding vehicles [26]. Huang et al. utilised the comprehensive information provided by cameras to devise a two-stream convolutional network architecture to perform near-accident detection [27].

2.3. Camera Calibration and Object Tracking Algorithms

Slow inference speed and expensive computation cost are two significant drawbacks of vision-based approaches that have been notably alleviated by incremental enhancements. Dailey et al. presented a novel approach for estimating traffic speed using a sequence of images from an uncalibrated camera by utilizing geometric relationships inherent in the image, using frame differencing to isolate moving edges and track vehicles between frames and using parameters from the distribution of vehicles lengths [8]. Ozbayoglu et al. assembled a predictive visual accident detection model to minimise the delay from network inferencing speed and evaluated the model with the nearest neighbour and regression tree, two less resource-demanding alternatives [28]. Agrawal et al. addressed the aforementioned difficulties by selectively dropping frames based on the histogram difference of consecutive frames to restrict unnecessary/similar frames and predict accidents with a support vector machine after feature extraction with ResNet50 and K-means clustering [29]. Zu et al. utilised the mean-shift algorithm as a computationally cheap method of tracking and evaluating vehicle motion to produce accident predictions. The Gaussian mixture model was incorporated to perform vehicle identification [30].

The aforementioned studies indicate that conventional image processing algorithms often outperform neural network-oriented algorithms in terms of speed and computation cost in a generic setting. To address the neural network’s speed deficiency in a real-time prediction scenario while maintaining accuracy, novel iterations of object detection algorithms have been proposed specifically to address online applications.

Redmon et al. proposed a network architecture, You Only Look Once (YOLO), for object detection and identification, which manifests significant computational time advantages in comparison to past detection models, such as sliding-window or region-based algorithms, due to its one-pass detection capability [31,32,33]. For this reason, YOLO is commonly integrated into vehicle-tracking models with performance demands as the object detector. YOLO integrated with deep simple and online real-time tracking (SORT) is the most common pair for object tracking [33]. The integrated procedure is composed of two stages: object detection via YOLO and object tracking via deep SORT. Although YOLO performs quickly in video object recognition, when processing object tracking tasks, the deep SORT stage takes time to re-ID objects by calculating their similarities, which improves the overall processing speed.

Multi-object tracking is an important task to complete in computer vision [7]. Compared to Deep SORT, in which the re-ID task is heavily influenced by the detection stage, FairMOT jointly optimises both the detection and re-ID stages, which is based on the anchor-free object detection architecture known as CenterNet [7].

Camera calibration for calculating actual speed from video processing has been studied by many scholars. Lai used road lanes, height, and the tilt angle of the camera to calibrate world coordinates in ground distance [34]. Bas and Crisman calibrated cameras by using the vanishing point of the road edge to compute the focal length and pan angle of the camera [35]. However, general camera calibration methods assume camera height and tilt angle beforehand or require special calibration patterns in traffic scenarios that are hard to find [10]. Fung et al. developed a camera calibration method by bridging the relationship between the image coordinates to the world coordinates in terms of pan angle, tilt angle, swing angle, focal length, and camera distance [36]. With these parameters, their study proposed a transformation matrix to convert image coordinates to world coordinates [36]. Our paper refers to Fung et al. for calibrating the camera in a typical scene and calculating vehicle speed from its image locations returned by modified FairMOT [36].

3. Methodology

This study aims to develop a new approach for analysing video from surveillance cameras and extracting dynamic traffic parameters. The model primarily utilised is modified FairMOT (mFairMOT), which employs an encoder–decoder neural network as its backbone architecture. mFairMOT produces detection frames and re-identification objects as its outputs. The detected frames are processed to determine real-time traffic volume and identify abnormal scenarios using logical algorithms. For re-identification, a spatial transformation matrix is computed to convert video coordinates into real-world coordinates. The speed of objects is calculated based on the distance they cover within a given unit of time. Figure 1 displays the flowchart of detection logic that is proposed in this study.

3.1. Video Detection and Tracking Algorithm

The primary purpose of this study is to develop a real-time efficient object tracking algorithm from video data. Conventional tracking methods employ batch-based algorithms [7]. However, these methods may not always fulfil real-time requirements in situations where there are numerous objects present. This could be attributed to (i) features that are not shared and (ii) the re-ID process being repeated. Recently, an object tracking algorithm known as fairness multi-object tracking (FairMOT) has been developed to perform cross-frame object association to accelerate processing speed. However, the algorithm needs to be modified, as the following fundamental issues for detecting and tracking moving vehicles should be addressed:

-: Simultaneously detecting and re-identifying (re-ID) tasks
-: Effectively fusing aggregated features
-: Person re-ID vs. vehicle re-ID
-: Faster learning ability
-: Generating high-quality re-ID features

Therefore, a modified FairMOT is developed in this paper. The advantage of using the FairMOT network is that it can jointly integrate and balance the detection and re-ID tasks. For detecting moving objects, the anchor-free centrenet method is integrated with FairMOT [7]. The backbone network of FairMOT, known as Resnet34 with deep layer aggregation (DLA), is normally employed to fuse aggregated features [7]. However, this is particularly effective in person re-identification (person re-ID). Therefore, the backbone network in FairMOT was replaced with ‘Darknet53′ to enhance vehicle detection and re-identification (see Figure 1) [7]. Although Darknet53 has a residual block similar to that of Resnet34, it employs densely connected layers to extract object features. We have also utilised a cross-stage partial network (CSPNet) to maximise the differences between gradient joints and prevent different layers from learning repeated gradient information. This addresses the problems relating to (i) massive computation and (ii) gradient diffusion. Consequently, the learning ability of the network can be greatly improved by integrating Darknet53 with the CSPNet. In addition, DLA is also used to fuse aggregated features for generating high-quality re-ID features.

Therefore, we modified the existing FairMOT network to efficiently track any moving objects, vehicles, and people. The FairMOT network is considered to be the state-of-the-art tracking model in multi-object detection, and its modified version developed in this paper, termed the modified FairMOT (mFairMOT) network, further increases model generalisability in tracking different types of objects. The structure of the mFairMOT is schematically developed in Figure 2.

As shown in Figure 2, the input image is first fed to an encoder–decoder network to extract high-resolution feature maps. The encoder–decoder network is composed of densely connected layers with the combination of the Focus layer, CBL block, CSP network, and ReLu activation function. Two homogeneous branches were then added for detecting objects and extracting re-ID features, respectively. The detection branch is to obtain object-detected frames, and the re-ID branch is for tracking purposes and assigns a unique ID to each tracked object to depict its trajectory. The features at the predicted object centres are used for tracking.

3.2. Vehicle Speed Calculation

To account for potential scenarios where the height and orientation of traffic cameras vary across diverse installations, it is necessary to conduct installation-specific calibration to attain precise vehicle position detection before deploying the system. An affine transformation from the image space to the world space is therefore deduced. A projection matrix is created based on road markings to acquire a representation of the position offset of each vehicle relative to the camera, which is then used for speed evaluation [36].

To ensure accurate calibration for each camera installation, a marker indicating the projected surface of a road segment must be provided. This marker should be a quadrilateral shape that encapsulates the outline of the road segment and represents a rectangle on the road surface. A sample marker preset is illustrated in Figure 3.

The marker deduction approach assumes that the road is approximately straight with parallel lanes, as well as a constant slope, as flat roads with no change in elevation throughout the view frustum of the camera are preferred, as the slope of an elevating road surface is seldom uniform, introducing projection error and decreasing the model’s accuracy [36].

Assume a camera model in the world space with positional parameters x, y, and z, yaw y (i.e., the horizontal angle of the forward axis of the camera with respect to the world Y axis), pitch p, roll r, tilt angle t, and projection parameters focal length f and camera length l, vehicle world coordinate

C_{w o r l d} \in R 4

and image space coordinate

C_{s c r e e n} \in R 3

(representing position vectors in the vertical homogeneous coordinate form), the transformation function:

R 2 \in R 3

maps the given screen coordinate vector

p

to the world coordinate given road marker data

m

via perspective transformation [37]. The perspective projection process is defined as [36,37]:

C_{s c r e e n} = P \times R \times T \times C_{w o r l d}

(1)

P

represents the projection matrix, while

P

and

R

represent the matrices responsible for the position and rotation of the viewport in world space, respectively. This affine transformation can be inverted to obtain the world coordinate from the screen coordinate of the vehicle:

C_{w o r l d} = P^{- 1} \times R^{- 1} \times T^{- 1} \times C_{s c r e e n}

(2)

Excluding conversion of camera properties from road markers, the transformation matrices are defined as:

T = [\begin{matrix} 1 & 0 & 0 & - x \\ 0 & 1 & 0 & - y \\ 0 & 0 & 1 & - z \\ 0 & 0 & 0 & 1 \end{matrix}]

(3)

R = [\begin{matrix} c o s y & - s i n y & 0 & 0 \\ s i n y & - c o s y & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}] [\begin{matrix} c o s p & 0 & s i n p & 0 \\ 0 & 1 & 0 & 0 \\ - s i n p & 0 & c o s p & 0 \\ 0 & 0 & 0 & 1 \end{matrix}] [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & c o s r & - s i n r & 0 \\ 0 & s i n r & c o s r & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(4)

P = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & \frac{1}{f} & 0 & 0 \end{matrix}]

(5)

The camera position can be inferred from the rotation and the camera length, namely [36]:

\begin{matrix} x = l \times s i n y \times c o s p \\ y = - l \times c o s y \times c o s p \\ z = - l \times s i n t \end{matrix}

(6)

The road marker’s geometry is labelled as shown in Figure 3.

Trivially, the label’s on-screen width

w

, along with other dimensions, can be determined by [36]:

\begin{matrix} \begin{matrix} α (A, B) = X_{B} - X_{A} \\ β (A, B) = Y_{B} - Y_{A} \end{matrix} \\ \begin{matrix} γ (A, B) = X_{A} \times Y_{A} - X_{B} \times Y_{A} \\ ω = β (A, C) \end{matrix} \end{matrix}

(7)

The camera parameters can then be determined as [36]:

f = \frac{γ (B, D) C o s y C o s p}{\begin{matrix} β (B, D) S i n y C o s r - β (B, D) C o s y S i n p S i n r + α (B, D) S i n y S i n r + \\ α (B, D) C o s y S i n p C o s r \end{matrix}}

(8)

t a n y = \frac{S i n p * (S i n r * (β (B, D) γ (A, C) - β (A, C) γ (B, D)) + C o s s * (α (A, C) γ (B, D) - α (A, C) γ (B, D)}{S i n r * (α (B, D) γ (A, C) - α (A, C) γ (B, D)) + C o s r * (β (B, D) γ (A, C) - β (A, C) γ (B, D))}

(9)

l = \frac{ω (f S i n p + X_{A} C o s p S i n r + Y_{A} C o s p C o s r) * (f S i n p + X_{C} C o s p S i n r + Y_{C} C o s p C o s r}{\begin{matrix} - (f S i n p + X_{A} C o s p S i n r + Y_{A} C o s p C o s r) * (X_{C} C o s y S i n r - X_{C} S i n y S i n p C o s r + Y_{C} C o s y C o s r + Y_{C} S i n y S i n p S i n r) \\ + (f S i n p + X_{C} C o s p S i n r + Y_{C} C o s p C o s r) * (X_{A} C o s y S i n r - X_{A} S i n y S i n p C o s r + Y_{A} C o s y C o s r + Y_{A} S i n y S i n p S i n r) \end{matrix}}

(10)

t a n r = \frac{\begin{matrix} - β (A, B) β (A, C) γ (B, C) α (C, D) + β (A, C) α (B, D) β (A, B) γ (C, D) \\ + β (C, D) γ (A, B) β (B, D) α (A, C) - β (A, B) γ (C, D) β (B, D) α (A, D) \\ - β (C, D) β (B, D) γ (A, C) α (A, B) - β (A, C) γ (A, B) α (B, D) β (C, D) \\ + β (A, B) γ (A, C) β (B, D) α (C, D) + β (C, D) β (A, C) γ (B, D) α (A, B) \end{matrix}}{\begin{matrix} - β (A, B) γ (A, C) γ (B, D) γ (C, D) + β (A, C) γ (A, B) β (B, D) α (C, D) \\ - β (A, C) α (B, D) α (A, B) γ (C, D) - α (A, C) γ (B, D) β (C, D) α (A, B) \\ - α (C, D) γ (A, B) β (B, D) α (A, C) + β (A, B) α (A, C) γ (B, D) α (C, D) \\ + α (A, B) γ (C, D) β (B, D) α (A, D) + α (B, D) γ (A, C) β (C, D) α (A, B) \end{matrix}}

(11)

The speed calculation of vehicles is hindered by positional inconsistencies induced by inaccuracies due to object detection and tracking. More specifically, vehicle orientation and missing frames in vehicle detection are the primary problems.

One influential factor is the oscillation of bounding boxes formed during the prediction from the mFairMOT network [7]. Since the FairMOT prediction is independent of the preceding and succeeding frames during the prediction, there is no penalty imposed regarding the geometric consistency of the predicted bounding boxes, inducing the noise in the vehicle’s positioning data.

In addition, mFairMOT experiences difficulties in detection involving complex spatial interactions, occlusions, and obscure object borders. This problem manifests mainly on road segments that are not in parallel with the camera frustum’s forward vector, where vehicles occasionally become obstructed by other objects or nearby vehicles, resulting in frames where the vehicle is not registered by the object detector. While this is sporadic, losing such frames results in significant inaccuracies in footage featuring the aforementioned road type.

To alleviate this deficiency, a custom interpolation filter Fi is incorporated to adapt to bounding box oscillations and frame skips for vehicle i. Each vehicle is assigned to a data tracker during its lifetime (exposure as an object in the footage). In evaluating speed S at the current frame, σ denotes the generalised kernel function for interpolating between two (often adjacent) footage frames:

S = σ_{F} (i, C_{w o r l d}, ∆ t_{i})

(12)

∆ t_{i}

denotes the time since the previous frame where the vehicle is detectable and imposes no distinction between delay from frame rate and delay of skipped frames. Speed per second is computed with abs

(C_{i} - F_{i, p r e v}) / ∆ t_{i}

(approximating with a linear interpolation over a missing frame) and transformed with a proportional–integral–derivative (PID) controller to reduce positional noise generated by the object detector. A PID controller is usually used in automatic control, but when applied to computer vision, it can accelerate the optimisation convergence process by reducing the overshoot problem in stochastic gradient descent (SGD) momentum of weights updating [10]. PID has three components, which need to be specified beforehand as

O_{P}

,

O_{I}

, and

O_{D}

. Specifications for the controller in this study are as follows:

\begin{matrix} O_{P} = 0.51 \\ O_{I} = 0 \\ O_{D} = 0.103 \end{matrix}

(13)

The PID controller in this context is incorporated as a filter for speed updating instead of as a ‘control’ algorithm. The controller output specifies the internal adjusted speed of a vehicle in the context frame, while the set point (SP) and process variable (PV) are the estimated vehicle speed (inputted from the projection transformation) and the internally predicted speed from the previous frame. Since the SP experiences shifts throughout the process, the integral component (

O_{I}

) is mostly irrelevant.

Vehicle speed is calculated by the distance that the vehicle has traveled in units of time. In our study, the unit of time is defined as one second, which is 30 frames. A modified FairMOT model is used to track the vehicle and assign a unique ID to the detected object. The distance that a vehicle has traveled in one second is then calculated by converting pixels to actual distance in the world coordinates through Equations (1)–(11) as shown above.

3.3. Traffic Volume Calculation

To compute traffic volume, a virtual box is created within each lane (see Figure 3). As a vehicle passes, it intersects with the virtual box, creating a shadow area. When the shadow area grows from 0% to over 70% and then reduces back to 0%, this indicates that a vehicle has passed. The traffic volume is then incremented by one. The choice of 70% as the threshold is empirically based. Figure 4 shows the traffic volume detection logic through a virtual box and shadow area movements when a vehicle passes.

3.4. Zero-Inflated Logarithmic Link for Count Time-Series Model

As discussed, traditional aggregated traffic flow data may not be suitable for modelling highly disaggregated traffic crashes. Therefore, the video analytics method developed in this paper could be employed in acquiring relevant data for such crash prediction models. Since we are interested in modelling traffic crashes that occur on a road segment over a time period (e.g., an hour, a day, a week, and so on), the statistical model should have the capability to control excess zeros as well as spatial and serial autocorrelation. Typical zero-inflated models can deal with excess zeros in crash prediction. However, such models cannot handle the serial correlation inherent in time-series crash data. A novel time series zero-inflated model with negative binomial assumption and a logarithmic link model has been developed to control both spatial and temporal structures in traffic crash analysis [11,38]. The same method is applied in this study to demonstrate the highly disaggregated traffic flow data obtained by the mFairMOT. The model is briefly explained here.

For a log-linear model with covariates where the parameters can be negative, the parameter space is taken to be:

Θ = {ϑ \in R^{a + b + c + 1} : β_{0} > 0, | β_{1} |, \dots, | β_{a} |, | α_{1} |, \dots, | α_{b} |, |\sum_{k = 1}^{a} β_{k} + \sum_{k = 1}^{b} α_{l}| < 1}

(14)

Under negative binomial distribution, the quasi-maximum estimation for the dispersion parameter φ given regression parameter Θ is expected to iterate until convergence [11,38].

For a given vector of observations

y = {\{y_{1}, \dots, y_{n}\}}^{T}

, the conditional quasi-log–likelihood function, up to a constant, can be expressed as:

P (ϑ) = \sum_{t = 1}^{n} l o g p_{t} (y_{t}; ϑ) = \sum_{t = 1}^{n} (y_{t} l n (λ_{t} (ϑ)) - λ_{t} (ϑ))

(15)

where

p_{t} (y_{t}; θ)

refers to the probability density function and

λ_{t} (θ)

refers to the conditional mean.

The video detection technique extracts dynamic traffic parameters and integrates these motion parameters with a zero-inflated logit count time-series model to make predictions on traffic crashes.

4. Results Analysis

4.1. Speed Detection Accuracy

Due to the dearth of publicly available labelled traffic cameras as the ground truth source for speed evaluation, a custom dataset with traffic footage labelled by LiDAR was established for model evaluation in this study. The validation dataset is composed of a total of two hours of videos with 15 min for each freeway section (with a coverage of 200 m). The traffic scenes are displayed as follows in Figure 5.

For an output of length M, a simple mean square error of the loss function J was incorporated for model evaluation as follows:

J (θ_{o u t p u t}) = \frac{1}{M} \sum_{i}^{M} {(θ_{o u t p u t, i} - θ_{g r o u n d t r u t h, i})}^{2}

(16)

False negatives during the object detection phase are treated as output speed 0. For ease of calculation and generalizability, the vehicle speed θ denotes the speed mean over a frame sequence of S instead of individual frames.

The dedicated validation dataset comprises footage from varying traffic camera heights and angles to simulate a real-world scenario. Dataset footage is categorised into groups by the respective camera view (e.g., parallel, diagonal, or perpendicular in terms of the camera’s view of the road’s direction). Camera installations with a view direction perpendicular to the road, while rare on existing highways, were selected to evaluate the influence of perspective view distance on the model’s accuracy. Losses resulting from the detection in different distance ranges (on the ground plane space) are separated to emphasise the effect on accuracy from the detection distance.

The detection accuracy is 1 minus the percentage of errors which is the proportion of difference between actual speed and estimated speed. This is less significant in the footage with a perpendicular view, as the forward direction of a vehicle is not affected by the projection transformation, thus retaining precision.

Another deficiency is the constant fluctuation of speed for vehicles near the vanishing point in a parallel view. Theoretically, the magnitude of fluctuation is exponentially larger for vehicles in greater proximity to the vanishing point, as each pixel translates to a larger span in world space. The fluctuation can be reasonably suppressed with the proportional integral derivative (PID) control as the kernel function σ as mentioned previously and contributes to tuning down sporadic spikes of false acceleration/deceleration [39]. PID function is a computer technique that was first introduced in industrial processing control for noise reduction and signal filtering.

Although incorporating a filtering kernel yields a significant reduction in fluctuating noise, the suppression weakens after about 50 m on footage obtained from standardized traffic cameras with an elevation of 12 m. Note that the distance as shown in Table 2 and Figure 5 is measured as the distance on the horizontal plane (the elevation of the camera is not taken into account).

As noted in Table 2, the estimated speed is quite precise within 50 m in the daytime with respect to the frontal view. For nighttime, the estimation is badly influenced by miss detection, as the available sight distance is short.

4.2. Traffic Volume and Abnormal Traffic Scenarios Detection

Modified fairness multi-object tracking (mFairMOT) models, along with different specific detection logic, was used to extract traffic operation parameters and emergency events from roadside video recordings including vehicle speed, traffic flow by lane, vehicle types, individual vehicle trajectory, accidents, abnormal parking, pedestrians on the road, unexpected objects, traffic congestion, vehicle fires, etc. Two types of mFairMOT tracking models were trained; one is an mFairMOT-vehicle tracking model, and the other is an mFairMOT-person tracking model.

(1): Abnormal parking

The modified fairness multi-object tracking (mFairMOT) neural network model was trained to assign each vehicle a unique ID. Even if the vehicle is shadowed for a while, the ID remains the same when it appears again once object features are correctly extracted and stays the same for a short period. The system issues an alarm when the same ID is found after one minute on the screen with the prerequisite that there is no ‘traffic congestion’ and no ‘pedestrian on the road’. Figure 6 displays the detection of abnormal parking scenarios.

(2): People on the road

The mFairMOT pedestrian tracking neural network model was trained to assign each person a unique ID when a person is detected. Even if the same person disappears for a while, the ID remains the same when it shows up again. The system issues an alarm when it detects a pedestrian walking on the freeway for five consecutive frames as shown in Figure 7.

(3): Traffic congestion

The mFariMOT vehicle tracking model calculates the number of vehicles on the screen. If the intersection of union (IOU) (area of overlap/area of union) is larger than 0.5 (can be tuned for specific scenarios), these restrictions are satisfied in the consecutive 10 frames. The system’s alarm is triggered and reports traffic congestion status to the respective control centre(s). The reason that the threshold is set to the tuning parameter is to reduce false positive calls for different traffic scenarios. Figure 8 displays the detection of congestion.

(4): Vehicle fire

The mFairMOT network is programmed to detect smoke and fire at the same time for consecutive 3 frames. If either fire or smoke is detected, then no alarm is triggered for fire. Fire and smoke classes have already been pre-trained and stored in the mFairMOT backend database. Figure 9 displays the detection of vehicle fires.

In summary, detection accuracy on different perspective settings is displayed in Table 3.

4.3. Real Application for Video Detection Algorithm in Monthly Crash Prediction

Video processing was applied to a freeway in southern China for extracting traffic features and building daily and monthly crash prediction models to forecast the number of crashes on every road segment for each direction.

The freeway is a two-way four-lane road with an overall speed limit of 110 km/h, a tunnel speed limit of 80 km/h, and a speed limit at 2 intervals of 80 km/h.

Dynamic traffic features such as vehicle speed, traffic volume by direction, and truck-to-car ratio were extracted through video processing, along with weather conditions, historical crash counts, and the effects of holidays and special events, especially for COVID-19 regulation policy, were set as independent variables to predict crash occurrence. A zero-inflated logarithmic link for count time-series (ZILT) model with a negative binomial distribution assumption for crash distribution was built for handling spatio-temporal structure and making a daily prediction. To make a comparison, the same model was also built to make crash predictions without involving dynamic traffic features.

Crash prediction accuracy is compared with and without involving dynamic traffic data as displayed in Table 4.

For the model without dynamic traffic features, the predicted results are more variant and R-square is significantly lower than the model with dynamic traffic features. Thiss means that more variances in crash occurrences can be explained by involving dynamic traffic features (obtained through the video analytics presented earlier) as independent variables. Video analytics plays an important role in traffic crash prediction by extracting extra dynamic motion parameters to gain information.

5. Discussion

Automatic crash detection and traffic flow analysis contribute significantly to proactive traffic safety management. Current standards and models for traffic crash prediction focus on sensor-dependent data models and external hardware installations (e.g., radar) [23,26]. Such systems suffer from the deficiency of imposing extra costs during implementation and maintenance. Moreover, there is a dearth of pre-existing roadway cameras with smart traffic analysis systems, the majority of which rely on manual monitoring of traffic footage. As an alternative to the aforementioned systems with high implementation costs and to make full use of pre-existed roadside cameras, this study developed a real-time system for multi-vehicle motion parameters with a typical roadside surveillance camera.

The resulting speed prediction demonstrated reasonable accuracy when compared to the ground truth obtained from the LiDAR. The PID filter was successful in suppressing the noise in the predictions of vehicles close to the camera; however, the suppressing effect was marginal for vehicles over ≈ 50 m from the camera (with an average decrease of 27% in object tracking accuracy). Much of this accuracy fluctuation was incurred by discrete prediction through different frames [40].

As the prediction of the bounding box of an object does not consider the prediction in the previous frames, the bounding box of the same object may be of drastically different shape or ratio through a sequential series of frames. While this disparity in shape only affects the resultant position prediction of the vehicle marginally, the resultant noise in the vehicle’s position severely affects the speed prediction in a further range, where the noise-suppressing PID is no longer effective.

Estimation accuracy largely varies between day to night because of the lighting conditions. Compared to radar or LiDAR, speed estimation from a computer vision technique is cost-effective and can return more abundant information such as vehicle types, traffic flow, trajectory, and abnormal incidents at lower cost. The modified FairMOT model changes the Resnet 34 backbone network of the original FairMOT model to CSPDarknet53. Compared to Resnet34, CSPDarnet53 has more layers to capture deeper object features and is a light but effective learning neural network that consists of residual blocks. The modified FairMOT model has better detection and re-ID ability in traffic scenarios. The following Table 5 compares model performance of the modified FairMOT with other state-of-the-are tracking models in the tested scenario.

The modified FairMOT model outperforms other tracking models in the tested scenario. The mFairMOT model balances detection and tracking processes fairly and improves re-identification branch through integrating CSPDarknet53 and retaining DLA structure in FairMOT to fuse aggregated features. YOLO-v8 has trouble in detecting small objects and provides various model sizes which sacrifice re-ID performance. DeepSORT has advantages in maintaining detection in occluded environments and can differentiate between objects in complex environments. However, DeepSORT is a two-stage model and trains separate detection networks, requiring extensive datasets for high accuracy. SORT is the preliminary version of DeepSORT, lacking feature extraction ability in complex scenarios. The Kalman filter is a statistical model and has pre-assumptions to be satisfied, which restricts its application in random motion scenarios. The Kalman filter is also the algorithm integrated into both FairMOT and DeepSORT models.

Limitations of the model mainly revolve around its deficiency of detection inaccuracy on distant vehicles and distorted camera angles. Due to the precision limitation on low-resolution footage, the model’s inaccuracy increases significantly as distance increases, reducing the model’s reliability on cameras positioned further from traffic. Moreover, the suppression effect of the PID kernel function lessens over distance. The suppression effect significantly falls off after ≤50 m, after which the model is no longer able to predict vehicle speed at a satisfactory accuracy. Distorted camera angles also influence speed estimation accuracy, since they increase difficulties in converting image coordinates into spatial coordinates through mathematical derivation. Errors may be generated. Reducing detected distance and choosing a front camera angle are effective ways to profoundly increase model performance by minimizing environmental noises.

In terms of the crash prediction model, setting dynamic traffic features as input variables achieves significantly higher accuracy than only considering the historical crash numbers, weather, and holiday conditions. This is because crash characteristics are related to dynamic traffic features. Causal inference of crash occurrence with influencing traffic dynamic parameters can be analysed with the assistance of video processing techniques on pre-crash scenario analysis in future studies.

Future improvements in this study intend to focus on estimating traffic crash probability by utilizing the vehicle speed determined in this study, and the aim is to devise a crash prediction framework in contributing to proactive traffic accident prevention. Specific research points include:

(1): The lane detection method can be integrated into the model. More detailed dynamic traffic parameters such as lane changing times and lane speed variances could be incorporated into the crash prediction model to enhance the accuracy of real-time crash prediction models.
(2): The causal effect of crash occurrence with dynamic traffic parameters could be analysed. Deep learning computer vision techniques could be applied to extract traffic parameters relating to pre-crash scenarios. The relationship between crash occurrence with influencing variables can be built and the influence of individual variables on crashes can be investigated.
(3): With involvement of traffic operation parameters and consideration of the importance of the temporal spatial structure in hourly traffic crash prediction, our ongoing research proposes a new joint model by combining the time-series generalized regression neural network (TGRNN) model and the binomially weighted convolutional neural network (BWCNN) model. The joint model aims to capture all these characteristics in short-term crash prediction.

6. Conclusions

This paper developed a deep learning video processing method to estimate highly disaggregated traffic characteristics from video data. The unique features and contributions of the paper include:

(i): Contrary to other existing techniques, such as drawing virtual boxes or setting reference lines that should be measured in advance, this method does not need any calibration that requires measuring in advance, meaning that this can be considered a ‘generic’ approach.
(ii): Important parameters can be estimated that could not be estimated by employing existing camera-based video processing methods. This includes lane change behaviour, speed variance, detecting fire, and vehicle trajectories.
(iii): This technique could offer higher accuracy, especially in estimating individual vehicle speed. Therefore, it not only can be applied to real-time videos but can also be used to read existing videos regardless of camera pose and recording angle.
(iv): This paper optimises a fairness multi-object tracking (FairMOT) network by using CSPDarknet53 to strengthen typical FairMOT ability in detecting and tracking traffic scenarios, where a bounding box was obtained by subjecting an encoder–decoder network within FairMOT predictions to non-max suppression, integrating CSPDarknet53 and retaining DLA structure in FairMOT to fuse aggregated features. The modified FairMOT associates the same vehicle across different frames, allowing the obtaining of screen–space distance offset per vehicle [33].
(v): A proportional integral derivative (PID) controller was incorporated as a kernel function σ to suppress sporadic noise in speed caused by precision limitation due to low resolution.
(vi): Finally, a world–space speed (i.e., traveling speed of vehicles) was obtained by transforming the image–space offset with a projection matrix concluded from pre-labelled road markings.

The detection and estimation accuracy for the proposed model in the tested scenario has been validated by ground truth data that were identified by observing recorded videos from roadside surveillance cameras in the past three years with professional justification of experts and the collection of actual vehicle speed with reference to LiDar, radar, and speed data returned from the Gaode Map Company. Results showed that the proposed method can successfully detect abnormal scenarios, such as accidents, abnormal parking, pedestrians, unexpected objects, traffic congestion, and vehicle fires. Results also demonstrated that speed detection accuracy on different distance ranges and view angles varies. In the most prevalent parallel view, the speed loss is minimized within 50 m, which amounts to a 5% speed estimate difference per vehicle in freeway analysis. In addition, the PID implementation of σ effectively suppresses speed noise by 36.62%.

A real-time abnormal traffic condition alarming system can be built with the application of the deep learning computer vision technique along with judgment logic to detect abnormal traffic incidents and set alarms for unprecedented conditions, such as vehicle speed, lane traffic flow, vehicle types, vehicle trajectory, accidents, abnormal parking, human walking, unexpected objects, traffic congestion, vehicle fires, etc. However, a real-time alarmi system demands sufficient GPU computing power and an internet connection to proceed with real-time data. This system can be established in the control centre of the freeway management department to monitor traffic conditions, and corresponding prevention methods can be devised to cope with emergency incidents ahead of time.

The developed algorithm makes good use of roadside cameras. A great number of roadway traffic operational data could provide precious information for freeway management. By making use of prompt traffic information, researchers can build an intelligent alert system that cannot only inform the control management centre of current traffic operating status (automatic surveillance) but also build a forecasting system to predict near future crash happening.

Author Contributions

Conceptualisation, B.C.; methodology, B.C.; formal analysis, B.C.; investigation, B.C.; writing—original draft preparation, B.C.; writing—review and editing, Y.F., X.W. and M.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to security and privacy reason.

Conflicts of Interest

The authors declare no conflicts of interest.

References

IHS Markit, Video Surveillance: How Technology and The Cloud is Disrupting The Market. 2016. Available online: https://cdn.ihs.com/www/pdf/IHS-Markit-Technology-Video-surveillance.pdf (accessed on 14 June 2024).
Sabey, B.; Staughton, G.C. Interacting Roles of Road Environment Vehicle and Road User in Accidents. CESTE 1 Most. 1975. Available online: https://trid.trb.org/View/46132 (accessed on 14 June 2024).
Hossain, M.; Muromachi, Y. A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways. Accid. Anal. Prev. 2011, 45, 373–381. [Google Scholar] [CrossRef]
Zheng, L.; Sayed, T.; Mannering, F. Modeling traffic conflicts for use in road safety analysis: A review of analytic methods and future directions. Anal. Methods Accid. Res. 2021, 29, 100142. [Google Scholar] [CrossRef]
Formosa, N.; Quddus, M.; Ison, S.; Abdel-Aty, M.; Yuan, J. Predicting real-time traffic conflicts using deep learning. Accid. Anal. Prev. 2020, 136, 105429. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Feng, Y.; Angeloudis, P.; Demiris, Y. Monocular visual traffic surveillance: A review. IEEE Trans. Intell. Transp. Syst. 2022, 23, 14148–14165. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the fairness of detection and re-identification in multiple object tracking. arXiv 2020, arXiv:2004.01888. [Google Scholar] [CrossRef]
Dailey, D.; Cathey, F.; Pumrin, S. The Use of Uncalibrated Roadside CCTV Cameras to Estimate Mean Traffic Speed. December 2021. Available online: https://rosap.ntl.bts.gov/view/dot/14762 (accessed on 16 June 2024).
Damulo, J.; Dy, R.; Pestaño, S.; Signe, D.; Vasquez, E.; Saavedra, L.; Cañete, E. Video-based traffic density calculator with traffic light control simulation. AIP Conf. Proc. 2020, 2278, 20046. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Cai, B.; Quddus, M.; Miao, Y. A new modelling approach for predicting disaggregated time-series traffic crashes. In Proceedings of the 102th Transportation Research Board Annual Meeting, Washington, DC, USA, 9–13 January 2022. [Google Scholar]
Retallack, A.; Ostendorf, B. Relationship between traffic volume and accident frequency at intersections. Int. J. Environ. Res. Public Health 2020, 17, 1393. [Google Scholar] [CrossRef]
Duivenvoorden, K. The Relationship between Traffic Volume and Road Safety on the Secondary Road Network; SWOV: Den Haag, The Netherlands, 2010. [Google Scholar]
Eisenberg, D. The mixed effects of precipitation on traffic crashes. Accid. Anal. Prev. 2004, 36, 637–647. [Google Scholar] [CrossRef]
Shefer, D.; Rietveld, P. Congestion and safety on highways: Towards an analytical model. Urban Stud. 1997, 34, 679–692. [Google Scholar] [CrossRef]
Milton, J.; Mannering, F. The relationship among highway geometrics, traffic-related elements and motor-vehicle accident frequencies. Transportation 1998, 25, 395–413. [Google Scholar] [CrossRef]
Ren, H.; Song, Y.; Wang, J.; Hu, Y.; Lei, J. A deep learning approach to the citywide traffic accident risk prediction. In Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3346–3351. [Google Scholar]
Heyns, E.; Uniyal, S.; Dugundji, E.; Tillema, F.; Huijboom, C. Predicting traffic phases from car sensor data using machine learning. Procedia Comput. Sci. 2019, 151, 92–99. [Google Scholar] [CrossRef]
Nithya, M.; Nagarajan, P.; Deepalakshmi, R.; Rani, M.; Swarna, S. Sensor based accident prevention system. J. Comput. Theor. Nanosci. 2020, 17, 1720–1724. [Google Scholar] [CrossRef]
Ajao, L.A.; Abisoye, B.O.; Jibril, I.Z.; Jonah, I.Z.; Kolo, J.G. In-vehicle traffic accident detection and alerting system using distance-time based parameters and radar range algorithm. In Proceedings of the 2020 IEEE PES/IAS PowerAfrica, Nairobi, Kenya, 25–28 August 2020; pp. 1–5. [Google Scholar]
Sable, T.; Parate, N.; Nadkar, D.; Shinde, S. Density and time based traffic control system using video processing. ITM Web Conf. 2020, 32, 3028. [Google Scholar] [CrossRef]
Nguyen, N.; Do, T.; Ngo, T.; Le, D. An evaluation of deep learning methods for small object detection. J. Electr. Comput. Eng. 2020, 2020, 3189691. [Google Scholar] [CrossRef]
Tian, D.; Zhang, C.; Duan, X.; Wang, X. An automatic car accident detection method based on cooperative vehicle infrastructure systems. IEEE Access 2019, 7, 127453–127463. [Google Scholar] [CrossRef]
Feng, Y.; Zhao, Y.; Zhang, X.; Batista, S.; Demiris, Y.; Angeloudis, P. Predicting spatio-temporal traffic flow: A comprehensive end-to-end approach from surveillance cameras. Transp. B Transp. Dyn. 2024, 12, 2380915. [Google Scholar] [CrossRef]
Wang, C.; Dai, Y.; Zhou, W.; Geng, Y. A vision-based video crash detection framework for mixed traffic flow environment considering low-visibility condition. J. Adv. Transp. 2020, 2020, 9194028. [Google Scholar] [CrossRef]
Yao, Y.; Xu, M.; Wang, Y.; Crandall, D.; Atkins, E. Unsupervised traffic accident detection in first-person videos. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Venetian Macao, Macau, 4–8 November 2019; pp. 273–280. [Google Scholar]
Huang, X.; He, P.; Rangarajan, A.; Ranka, S. Intelligent intersection: Two-stream convolutional networks for real-time near-accident detection in traffic video. ACM Trans. Spat. Algorithms Syst. 2020, 6, 10. [Google Scholar] [CrossRef]
Ozbayoglu, M.; Kucukayan, G.; Dogdu, E. A real-time autonomous highway accident detection model based on big data processing and computational intelligence. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; Volume 12, pp. 1807–1813. [Google Scholar]
Agrawal, A.K.; Agarwal, K.; Choudhary, J.; Bhattacharya, A.; Tangudu, S.; NMakhija, M.; Bakthula, B. Automatic traffic accident detection system using resnet and svm. In Proceedings of the 2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), Bangalore, India, 26–27 November 2020; pp. 71–76. [Google Scholar]
Zu, H.; Xu, Y.; Ma, L.; Fang, J. Vision-based real-time traffic accident detection. In Proceeding of the 11th World Congress on Intelligent Control and Automation, Shenyang, China, 9 June–4 July 2014; pp. 1035–1038. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Lai, H.S. Vehicle Extraction and Modeling, an Effective Methodology for Visual Traffic Surveillance. Ph.D. Thesis, The University of Hong Kong, Hong Kong, 2000. [Google Scholar]
Bas, E.K.; Crisman, J.D. An easy to install camera calibration for traffic monitoring. In Proceedings of the Conference on Intelligent Transportation Systems, Boston, MA, USA, 9–12 November 1997; pp. 362–366. [Google Scholar]
Fung, G.; Yung, N.; Pang, G. Camera calibration from road lane markings. Opt. Eng. 2003, 42, 2967–2977. [Google Scholar] [CrossRef]
Haralick, R.M. Using perspective transformations in scene analysis. Comput. Graph. Image Process. 1980, 13, 191–221. [Google Scholar] [CrossRef]
Liboschik, T.; Fokianos, K.; Fried, R. Tscount: An R package for analysis of count time series following generalized linear models. J. Stat. Softw. 2017, 82, 1–51. [Google Scholar] [CrossRef]
An, W.; Wang, H.; Sun, Q.; Xu, J.; Dai, Q.; Zhang, L. A PID controller approach for stochastic optimization of deep networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8522–8531. [Google Scholar]
Kim, J.; Sung, J.; Park, S. Comparison of faster-rcnn, yolo, and ssd for real-time vehicle type recognition. In Proceedings of the 2020 IEEE International Conference on Consumer Electronics—Asia (ICCE-Asia), Seoul, Republic of Korea, 1–3 November 2020; pp. 1–4. [Google Scholar]

Figure 1. Flowchart of detection logic.

Figure 2. Modified FairMOT (mFairMOT) network structure (CBL: block for convolution+batch normalisation+Leaky_Relu; Res Unit: residual block unit; Leaky Relu: an activation function; BN: batch normalisation; CSP: cross-stage partial network; SPP: spatial pyramid pooling).

Figure 3. Road marker label (ABCD are anchor pointes and W represents the width of the rectangle).

Figure 4. Virtual box and shadow area movements when a vehicle passes.

Figure 5. Representative scenario used for speed detection (straight lines denote detection region).

Figure 6. Abnormal parking detection (straight lines denote detection region).

Figure 7. Person on freeway detection.

Figure 8. Traffic congestion detection.

Figure 9. Detection of vehicle fire (blue: fire, red: smoke).

Table 1. Comparison of modified FairMOT with other conventional methods.

Model Comparison	Benefits	Drawback
Visual average speed computer and recorder of OpenCV (VASCAR) [8]	Light computing	Intensively relies on object detection. If the first crossing point of the reference line or the exiting point appears earlier or later, speed will be calculated differently.
Yolo+DeepSORT [9]	Resolves the time difference problem by not setting a reference line	Suffers from the segregation of the detection and tracking processes. The overall speed detection accuracy is influenced by small disturbances.
FairMOT [7]	Combines detection and re-ID process	Re-ID accuracy should be improved to be more robust.
Modified FairMOT	Can be adopted to track vehicles and their trajectory generation	Not apparent drawback.

Table 2. Speed detection accuracy on different perspective settings.

Distance (Meter)	Side View	Right Frontal View
≤20	96% (daytime)/92% (nighttime)	98% (daytime)/94% (nighttime)
≤50	91%(daytime)/84%(nighttime)	95% (daytime)/88% (nighttime)
>50	75%(daytime)/70%(nighttime)	85% (daytime)/80% (nighttime)

Table 3. Detection accuracy on different perspective settings.

Item	Accuracy	False Positive Reason	Optimisation Solutions
Traffic Flow	96% (Daytime), 95% (Nighttime)	Influenced by articulated vehicle	Label more data for special vehicles
Parking	92% (Daytime) 90% (Nighttime)	Recognise roadside signs as car, especially at night	Label more data that cater to specific scenarios
Pedestrian Walking	93% (Daytime) 89% (Nighttime)	Recognise roadside tree as person occasionally	Label more data for specific scenarios
Vehicle Fires	96% (Daytime) 97% (Nighttime)	Sometimes when the intensity of the fire is low, the algorithm does not make alarm.	Label more data for both smoke and flame
Traffic Congestion	92% (Daytime) 90% (Nighttime)	Affected by tracking models. Same ID may be assigned twice.	Improve tracking model stability by changing its structure

Table 4. Prediction accuracy on different perspective settings (based on Ground Truth data).

	Prediction Accuracy	Mean Square Error
Prediction without Dynamic Traffic Data	78.5%	0.96
Prediction with Dynamic Traffic Data	83.1%	0.88

Table 5. Model performance comparison for different tracking models in the tested scenario.

Model	IDF1—The Fraction of Correctly Identified Detections over the Average Number of True Detections	Multiple Object Tracking Precision (MOTP)—Measures the Accuracy of Detection Box Localization	Multi-Object Tracking Accuracy—MOTA Measures the Overall Accuracy of both the Tracker and Detection
mFairMOT	96.0%	98.5%	95.7%
YOLOv8	94.5%	97.2%	94.3%
SORT	90.8%	94.3%	90.5%
DeepSORT	93.9%	95.9%	93.8%
Kalman Filter	88.2%	91.2%	87.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, B.; Feng, Y.; Wang, X.; Quddus, M. Highly Accurate Deep Learning Models for Estimating Traffic Characteristics from Video Data. Appl. Sci. 2024, 14, 8664. https://doi.org/10.3390/app14198664

AMA Style

Cai B, Feng Y, Wang X, Quddus M. Highly Accurate Deep Learning Models for Estimating Traffic Characteristics from Video Data. Applied Sciences. 2024; 14(19):8664. https://doi.org/10.3390/app14198664

Chicago/Turabian Style

Cai, Bowen, Yuxiang Feng, Xuesong Wang, and Mohammed Quddus. 2024. "Highly Accurate Deep Learning Models for Estimating Traffic Characteristics from Video Data" Applied Sciences 14, no. 19: 8664. https://doi.org/10.3390/app14198664

APA Style

Cai, B., Feng, Y., Wang, X., & Quddus, M. (2024). Highly Accurate Deep Learning Models for Estimating Traffic Characteristics from Video Data. Applied Sciences, 14(19), 8664. https://doi.org/10.3390/app14198664

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Highly Accurate Deep Learning Models for Estimating Traffic Characteristics from Video Data

Abstract

1. Introduction

2. Literature Review

2.1. Sensor Dependent Detection Methods

2.2. Video-Based Computer Vision Detection Methods

2.3. Camera Calibration and Object Tracking Algorithms

3. Methodology

3.1. Video Detection and Tracking Algorithm

3.2. Vehicle Speed Calculation

3.3. Traffic Volume Calculation

3.4. Zero-Inflated Logarithmic Link for Count Time-Series Model

4. Results Analysis

4.1. Speed Detection Accuracy

4.2. Traffic Volume and Abnormal Traffic Scenarios Detection

4.3. Real Application for Video Detection Algorithm in Monthly Crash Prediction

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI