2.1. Architecture description
The proposed traffic collection sensor comprises two FireWire cameras: one rear looking camera and another forward looking one. Thus, the sensor range covers the local environment of the host vehicle enabling a nearly 360 degree field of view with the exception of the side blind areas (see
Figure 1). A common hardware trigger synchronizes the image acquisition of both cameras and an onboard PC houses the computer vision software.
Each individual vehicle detection system provides information about the number of detected vehicles and both their relative position and speed. These results are combined with the GPS measurements and the data provided by the CAN bus (vehicle speed) in order to provide globally referenced traffic information. Note that for vehicles without CAN bus interface, the vehicle speed can be computed from GPS measurements. This scheme is described in
Figure 2.
The layers of the proposed architecture of both vision modules are conceptually the same: lane detection, vehicle-candidates selection, vehicle recognition and tracking. The first step of each one of the vision systems consists of reducing the searching space in the image plane in an intelligent manner in order to increase the performance of the vehicle detection module. Accordingly, road lane markings are detected and used as the guidelines that drive the vehicle searching process. The area contained by the limits of the lanes is scanned in order to find vehicle candidates that are passed on to the vehicle recognition modules. Thus, the rate of false positives is reduced. In case that no lane markings are detected, a basic region of interest is used instead covering the front, rear and side parts of the vehicle. Finally, a tracking stage is implemented using Kalman filtering techniques.
2.2. Lane detection
An attention mechanism is necessary in order to filter out inappropriate candidate windows based on the lack of distinctive features, such as horizontal edges and vertical symmetrical structures, which are essential characteristics of road vehicles. This has the positive effect of decreasing both the total computation time and the rate of false positive detections. Lane markings are detected using gradient information in combination with a local thresholding method which is adapted to the width of the projected lane markings. Then, clothoid curves are fitted to the detected markings. The algorithm scans up to 25 lines in the candidates searching area, from 2 meters in front of the camera position to the maximum range in order to find the lane marking measurements. The proposed method implements a non-uniform spacing search that reduces certain instabilities in the fitted curve. The final state vector is composed of six variables [
9] for each lane on the road:
where
c0h and
c1h represent the clothoid horizontal curvature parameters,
c0v and
c1v stand for the clothoid vertical curvature parameters, while
x0,
θ0 and
w0 are the lateral error and orientation error with regard to the centre of the lane and the width of the lane respectively. The clothoid curves are then estimated based on lane marking measurements using a Kalman filter [
10] for each lane.
Apart from the detected road lanes additional virtual lanes have been considered so as to cope with situations in which a vehicle is located between two lanes (for example, if it is performing a lane change manoeuvre). Virtual lanes provide the necessary overlap between lanes, avoiding both misdetections and double detections caused by the two halves of a vehicle being separately detected as two potential vehicles. A virtual lane is located to provide overlap between two adjoining lanes.
Figure 3 provides some examples of lane markings detection in real outdoor scenarios. Detected lanes determine the vehicle searching area and help reduce false positive detections. In case no lane markings are detected by the system, fixed lanes are assumed instead.
2.3. Forward and rear vehicle detection
Forward and rear looking vehicle detection systems share the same algorithmic core. The attention mechanism sequentially scans each road lane from the bottom to the maximum range looking for a set of features that might represent a potential vehicle. Firstly, the vehicle contact point is searched by means of the top-hat transformation. This operator allows the detection of contrasted objects on non-uniform backgrounds [
11]. There are two different types of top-hat transformations: white hat and black hat. The white hat transformation is defined as the residue between the original image and its opening. The black hat transformation is defined as the residue between the closing and the original image. The white and black hat transformations are analytically defined as follows:
where ○ denotes the opening operator and • means for the closing operator. In our case we use the white hat operator [
Equation (2)] since it enhances the boundary between the vehicles and the road [
12]. Horizontal contact points are pre-selected if the number of white top-hat features is greater than a configurable threshold. Then, candidates are pre-selected if the entropy of Canny points is high enough for a region defined by means of perspective constraints and prior knowledge of target objects (see
Figure 4).
Before computing the Canny features, an adaptive thresholding method is applied. This process is based on an iterative algorithm that gradually increases the contrast of the image, and compares the number of Canny points obtained in the contrast increased image with the number of edges obtained in the current image. If the number of Canny features in the actual image is higher than in the contrast increased image the algorithm stops. Otherwise, the contrast is gradually increased and the process resumed. This adaptive thresholding method permits to obtain robust image edges, as depicted in the examples provided in
Figure 5.
In a second step, vertical edges (
Sv), horizontal edges (
Sh) and grey level (
Sg) symmetries are obtained, so that, candidates will only pass to the next stage if their symmetries values are greater than a threshold. The vertical and horizontal edges symmetries are computed as listed in
Figure 6. The grey level symmetry computation procedure is shown in
Figure 7. Some examples of the three types of symmetries are depicted in
Figure 8.
Symmetry axes are linearly combined to obtain the final position of the candidate. Finally, a weighted variable is defined as a function of the entropy of Canny points, the three symmetry values and the distance to the host vehicle. We use this variable to apply a non-maximum suppression process per lane which removes overlapped candidates. An example of this process is shown in
Figure 9.
The selected candidates are classified by means of a linear Support Vector Machine (SVM) classifier [
13], in combination with Histograms of Oriented Gradients features [
14]. We have developed and tested two different classifiers depending on the module (forward and rear classifiers). All candidates are resized to a fixed size of 64 × 64 pixels to facilitate the features extraction process. The rear-SVM classifier is trained with 2,000 samples and tested with 1,000 samples (1/1 positive/negative ratio) whereas the forward-SVM classifier is trained with 3,000 samples and tested with 2,000 samples (1/1 positive/negative ratio).
Figures 10 and
11 depict some positive and negative samples of the forward and rear training and test data sets respectively.
Figure 12 shows a couple of examples of vehicle detection after linear SVM classification with HOG features.
After detecting consecutively an object classified as vehicle a predefined number of times (empirically set to 3 in this work), data association and tracking stages are triggered. The data association problem is addressed by using feature matching techniques. Harris features are detected and matched between two consecutive frames, as depicted in
Figure 13.
Tracking is implemented using Kalman filtering techniques [
10]. For this purpose, a dynamic state model and a measurement model must be defined. The proposed dynamic state model is simple. Let us consider the state vector
xn, defined as follows:
In the state vector
x and
y are the respective horizontal and vertical image coordinates for the top left corner of every object, and
w and
h are the respective width and height in the image plane. A dynamical model equation can be written like this:
In the model, Δ
t is the simple time,
A represents the system dynamics matrix and
ωn is the noise associated to the model. Although the definition of A is simple, it proves to be highly effective in practice since the real time operation of the system permits to assure that there will not be great differences in distance for the same vehicle between consecutive frames. The model noise has been modelled as a function of distance and camera resolution. The state model equation is used for prediction in the first step of the Kalman filter. The next step is to define the measurement model. The measurement vector is defined as
zn = [
u, v, w, h]
T. Then, the measurement model equation is established as follows:
In last equation H represents the measurement matrix and vn is the noise associated to the measurement process. The purpose of the Kalman filtering is to obtain a more stable position of the detected vehicles. Besides, oscillations in vehicles position due to the unevenness of the road makes v coordinate of the detected vehicles change several pixels up or down. This effect makes the distance detection unstable, so a Kalman filter is necessary for minimizing these kinds of oscillations.
2.4. Error analysis
Accurate detection of the wheel-to-road contact point of the preceding vehicle is essential for assuring maximum precision of the host-to-vehicle estimated distance. Thus, the error committed in estimating the host-to-vehicle distance
Zerr due to a vehicle detection error of
n pixels in the image plane is given by:
where
v is the vertical coordinate of the wheel-to-road contact point in the image plane,
Z is the estimated host-to-vehicle distance,
fv is the vertical focal length in pixels and
hCAM represents the elevation of the camera above the ground. Considering an error of one pixel
n = 1 and
fvhCAM ≫
nZ,
Zerr becomes:
For example, for a 320 × 240 image, a focal length
fv = 370
px, and a camera height
hCAM = 1.2
m, an error of 1 pixel (
n = 1) becomes a relative 5% error at a distance:
On the other hand, the error at 44.4 m is 10%. In
Figure 14 we can see the depth accuracy due to quantization for different images resolutions. As can be seen, the larger the images resolution the better the accuracy. Unfortunately, a trade off must be reached between the accuracy of the depth measurements and the computational costs. In our case the size of the images is 320 × 240 pixels which provides accuracy more than enough for automotive applications.
The distance measurements are used to obtain the relative host-to-vehicle velocity. Relative velocity
vH2V is computed using the following equation:
Based on the scale change
s of detected objects in the image plane, the optimal value of Δ
t that minimizes the estimation noise can be calculated. Let
W denote the width (in meters) of the preceding vehicle,
w and
w′ the width of the preceding vehicle in the image plane when it is located at distances
Z and
Z′, respectively, with regard to the host vehicle. The scale change
s can be defined as:
Then, the estimated relative velocity can be computed as follows:
As demonstrated in [
15], the value of Δ
t that minimizes the error in the estimated relative velocity is given by:
where
a represents the acceleration of the host vehicle, and
serr is the error committed in the estimation of scale change. Building on this result, the optimal value of Δ
t for zero acceleration is infinite. In practice, it has been limited to Δ
t = 1.0 s, which matches with both the GPS and the CAN bus sample time (1 Hz).
2.5. Traffic load and road speed
As depicted in
Figure 2, the Traffic Data Collection module uses three sources of data: the measurements provided by the GPS, the data supplied by the CAN bus (vehicle speed) and the outputs obtained from both vision-based vehicle detection systems. Whereas the GPS and the CAN bus sample frequency is 1 Hz, the vision-based system operates in real-time at 25 frames per second (25 Hz). In order to obtain measurements from GPS and CAN bus at 25 Hz we apply a linear interpolation between two consecutive samples.
The outputs of the forward and rear vehicle detection systems at frame
i are the number of detected vehicles
Ni and their corresponding distances to the host vehicle
. These outputs are combined to cover the local environment of the vehicle. The traffic load at frame
i is given by next equation:
where
NMAX is the maximum number of vehicles in range that can be detected by both systems (in our case
NMAX is defined as 8 or 12 for two lanes and three lanes roads respectively). The average road speed at frame
i is computed as follows:
where
and
represent the distance between the host vehicle and vehicle
k at frames
i and
i − 1 respectively, Δ
t corresponds to the sample time (which is limited to 1.0 s as described in Section 2.4),
is the host vehicle speed provided by the CAN bus, and
Ni is the number of detected vehicles. Note that the distance values correspond to filtered measurements since they are obtained from the first two elements of the Kalman filter state vector (
u and
v) using known camera geometry and ground-plane constraints.