1. Introduction
We live in a constantly developing world, where cities are becoming more and more crowded, and the number of cars is constantly increasing. Traffic becomes more and more congested because the road infrastructure can no longer cope with the increasing number of vehicles. This means more fuel consumption, more pollution, longer journeys, stressed drivers, and, most importantly, an increase in the number of accidents. Pedestrians, cyclists, and motorcyclists are the most exposed to road accidents. According to a report from 2017 by the World Health Organization, every year, 1.25 million people die in road accidents, and millions more are injured [
1]. The latest status report from 2023 [
2] indicates a slight decrease in the number of road traffic deaths to 1.19 million per year, highlighting the positive impact of efforts to enhance road safety. However, it underscores that the cost of mobility remains unacceptably high. The study described in [
3] tracked the progress of reducing the number of car accidents since 2010 in several cities. It concluded that very few of the studied cities are improving road safety at a pace that will reduce road deaths by 50% by 2030, in line with the United Nations’ road safety targets.
Autonomous vehicles can avoid some errors made by drivers, and they can improve the flow of traffic by controlling their pace so that traffic stops oscillating. They are equipped with advanced technologies such as global positioning systems (GPS), video cameras, radars, light detection and ranging (LiDARs), and many other types of sensors. They can travel together, exchanging information about travel intentions, detected hazards and obstacles, etc., through vehicle-to-vehicle (V2V) or vehicle to everything (V2X) communication networks.
To increase the efficiency of road traffic, the idea of grouping autonomous vehicles into platoons through information exchange was proposed in [
4]. Vehicles should consider all available lanes on a given road sector when forming a group and travel at high speeds with minimal safety distances between them. However, this is possible only if a vehicle can determine its precise position with respect to other traffic participants.
In recent years, image processing and computer vision techniques have been widely applied to solve various real-world problems related to traffic management, surveillance, and autonomous driving. In particular, the detection of traffic participants such as vehicles, pedestrians, and bicycles [
5] plays a crucial role in many advanced driver assistance systems (ADAS) and smart transportation applications.
Image processing is important in optimizing traffic by being used to develop functionalities that reduce the number of accidents, increase traffic comfort, and group vehicles into platoons. Several approaches have been proposed over time to detect traffic participants with video cameras using convolutional neural networks [
6,
7], but most of them require a significant amount of computing power and cannot be used in real-time due to increased latency.
In this paper, a proof of concept algorithm to solve a part of the image-based vehicle platooning problem is proposed. It uses a decentralized approach, with each vehicle performing its own computing steps and determining its position with respect to the nearby vehicles. This approach relies on images acquired by the vehicle’s cameras and the communication between vehicles. To test our approach, we used cheap commercial dashboard cameras equipped with a GPS sensor. No other sensors were used, mainly because they would have greatly increased the hardware cost. Each vehicle computes an image descriptor for every frame in the video stream, which it sends as a message along with other GPS information to other vehicles. Vehicles within communication range receive this message and attempt to find the frame in their own stream that most closely resembles the received one. The novelty of this approach lies in calculating the distance between the two vehicles by matching image descriptors computed for frames from both vehicles, determining the time difference at which the two frames were captured, and considering the traveling speeds of vehicles.
The rest of the paper is organized as follows.
Section 2 presents some vehicle grouping methods for traffic optimization, then reviews applications of image processing related to street scenes, and, lastly, presents several image descriptors. The method proposed in this paper is detailed in
Section 3, while in
Section 4, the implementation of the algorithm is presented. In
Section 5, preliminary results are presented, demonstrating the feasibility of the proposed algorithm. Finally,
Section 6 presents the main conclusions of this study and directions for future research.
3. Precise Localization Algorithm
In this paper, an algorithm is proposed to help vehicles position themselves with respect to other nearby vehicles. The approximate distance between two nearby vehicles is computed using GPS data. The exact position between two vehicles cannot be computed using only GPS data because all commercial GPS devices have an intended positioning error [
37]. This error might increase further according to various specific conditions, like the number of available satellites, nearby buildings, trees, driving into tunnels, etc. For example, when using a smartphone, the GPS error can be, on average, as much as 4.9 m [
38]. Such errors can lead to potentially dangerous situations if any relative vehicle positioning system relies solely on GPS data. For this reason, the aim of this paper is to increase the positioning accuracy using images captured by cameras mounted on each vehicle. Thus, the proposed solution aims to find two similar frames from different vehicles within a certain distance range. Each vehicle sends information about multiple consecutive frames while also receiving similar information from other vehicles for local processing. By using an algorithm to match image descriptors calculated based on these frames, a high number of matches indicates that the vehicles are in relatively the same position. Using the timestamps associated with the two frames, we can determine the moment each vehicle was in that position, allowing us to calculate the distance between them by considering their traveling speed and the time difference between the two.
The proposed approach is decentralized, meaning that each vehicle acts as an independent entity. It has to cover the information exchange with the other vehicles as well as processing the self-acquired and received data. In our model, vehicles employ a V2X communication system with the broadcast information, but they will also use a V2V communication model if the distance to a responding nearby vehicle is below a pre-defined threshold (
Figure 1). Each vehicle will broadcast processed information, not being aware if any other vehicle will receive it.
The proposed algorithm assumes that each vehicle is equipped with an onboard camera with GPS and a computing unit. The GPS indicates the current position of the vehicle in terms of latitude and longitude, as well as the timestamp at that time and the vehicle’s speed. The vehicle computing unit will handle all computations and communications, so it will process the data and send it to the other vehicles. Also, the processing unit will receive data from other vehicles that are nearby. The processing unit must determine, based on the received information, if a V2V communication can start between the two vehicles. If it can, it will begin an information exchange with the other vehicle and will process the subsequent received data. This means that each vehicle has two roles: the first one involves data processing and communication, while the second involves receiving messages from other nearby vehicles and analyzing them. As the paper does not focus on the communication model itself but rather on the image processing part, we employed a very simple and straightforward communication model. This model cannot be used in real-world applications, where security, compression, and other factors must be taken into consideration. Our main focus when developing the communication model was the main processing steps needed from the image processing point of view. The send and receive roles are described in the following subsection.
4. Framework for Algorithm Implementation
The following section details the implementation of the algorithm, which involves several steps: resizing the image dimensions, extracting the camera-displayed time, simulating the communication process, detecting the corresponding frame, defining the distance calculation formula, and outlining the hardware equipment utilized.
4.5. Compute Distance
After detecting the corresponding frame in the video stream, the next step involves computing the distance between vehicles and determining their positions. This can be achieved using information from the two vehicles associated with these frames, such as the timestamp and speed.
Thus, if the timestamp from the current vehicle is greater than that of the other vehicle, it indicates that the latter is in front of the current one. The approach to comparing the two timestamps is presented in
Figure 6. By knowing the speed of the front vehicle and the time difference between the two vehicles, the distance between them is computed.
Conversely, if the timestamp from the current vehicle is smaller than that of the other vehicle, it suggests that the latter is behind the current one. With the speed of the current vehicle and the time difference between the two vehicles, the distance between them can still be computed.
Given that the video operates at a frequency of 30 frames per second and GPS data is reported every second, each of the 30 frames contains the same set of information. However, this uniformity prevents the exact determination of distance because both frame 1 and frame 30 will have the same timestamp despite an almost 1-s difference between the two frames.
To enhance the accuracy of distance computation between the two vehicles, adjustments are made to the timestamp for the frames from both vehicles. In addition to other frame details, the frame number reported with the same timestamp (ranging from 1 to 30) is transmitted. In the distance computation function, the timestamp is adjusted by adding the current frame number divided by the total number of frames (30). For instance, if the frame number is 15, 0.5 s are added to the timestamp.
In
Figure 7, the method of computing distance assuming that Vehicle 1 is in the front and Vehicle 2 is behind is detailed. Frame V1 from Vehicle 1, which is the
x-th frame at timestamp
T1, is detected by Vehicle 2 as matching with frame V2, which is the
y-th frame at timestamp
. To determine the position relative to Vehicle 1, the other vehicle needs to compute the distance traveled by the first vehicle in the time interval from timestamp
to the current timestamp
, taking into account its speed.
To compute the distance as accurately as possible, the speed reported at each timestamp is considered, and the calculation formula is presented in Equation (
16). Since Frame V1 is the
x-th frame at timestamp
, and considering that there are 30 frames per second, the time remaining until timestamp
second can be determined. Then, this time interval is multiplied by speed
at timestamp
to determine the distance traveled in this interval. The distance traveled from timestamps
to
is determined by multiplying the speeds
at these timestamps by 1 s each. To determine the distance traveled from
to the frame
y-th, the speed
at
is multiplied by
. By summing all these distances, the total distance is obtained.
5. Performed Experiments and Test Results
Based on the implementation presented in the previous section, a series of tests were conducted to demonstrate both the feasibility of the algorithm and its performance. The performances of the BEBLID, ORB, and SIFT descriptors were tested, as well as how the number of features influences frame detection. Finally, a comparison between the distance calculated by the proposed algorithm, the one calculated based on GPS data, and the measured distance is presented to illustrate the preciseness of the proposed algorithm in real-world applications. This comparison shows that the algorithm reflects a high degree of accuracy when validated against physically measured distances, which demonstrates its potential effectiveness for applications requiring precise distance calculations, e.g., vehicle platooning applications.
5.2. Descriptor Performance Test
The underlying concept of this test involved the deployment of two vehicles equipped with dashboard cameras driving on the same street. As they progressed, the cameras recorded footage, resulting in two distinct videos.
Using these video recordings, the objective of the test was to identify 50 consecutive frames from the leading vehicle within the footage captured by the trailing vehicle. Each frame from the first video was compared with 50 frames from the second video, and the frame with the highest number of matches was taken into consideration. This aimed to ascertain the algorithm’s capability to consistently detect successive frames, thus showcasing its robustness. Furthermore, a secondary aim was to evaluate the performance of the three descriptors used in the process. In
Figure 10a, one of the frames from the car in front (left) and the matched frame from the rear car are presented (right). In the frame on the right, the front car is also visible.
In
Table 1, the results of the three descriptors for a total of 20,000 features are presented. BEBLID successfully detected 39 frames correctly, with instances of incorrect detections shown in blue in the table. These incorrect detections exhibit a minor deviation by detecting a frame either preceding or following the actual frame, which poses no significant concern.
Note that, the frames shown in blue in
Table 1 might be caused by the fact that a perfect synchronization between frames of the used videos cannot be accomplished. For example, the first car traveled at 21.9 km/h or 6.083 m/s records a frame every 0.2 m (considering 30 frames per second). This sampling rate might, in our opinion, cause some of the slightly incorrect detections presented in blue in
Table 1. This is the reason to use the blue color, because they might result from the sampling rate of the used cameras and not by an actual error in the matching algorithm.
Similarly, ORB shows good performance by correctly detecting 38 frames. However, the performance of SIFT falls short of expectations, with only 23 out of 50 frames being detected accurately. Additionally, for SIFT, there are cases when it detected a frame from behind after the detection of a subsequent frame, indicated in red in the table. Moreover, the case when the difference between the correct frame and the predicted one is greater than 1 frame is highlighted in orange in the table. Another downside of using SIFT is that it has more cases with three consecutive detections of the same frame than the other two descriptors (BEBLID-0, ORB-1 (frame 45), SIFT-4 (frames 55, 63, 71, 73)). Also, when using SIFT, there are cases when two consecutive detected frames differ by three frames (frames 29, 58, 66 and 71), which is a case that was not encountered when using BEBLID of ORB.
Furthermore, it is noteworthy to highlight that a higher number of matches, as observed in the case of SIFT, does not necessarily translate to better performance. Despite BEBLID having a lower number of matches compared to the other two descriptors, it achieved the highest performance in this test.
These findings underscore the importance of not only relying on the quantity of matches but also considering the accuracy and robustness of the detection algorithm. In this context, BEBLID stands out as a promising descriptor for its ability to deliver reliable performance even with a comparatively lower number of matches.
It is worth mentioning that, the frames written in orange and red are most likely errors in the detection algorithm and can lead to potentially dangerous situations if their number increases.