1. Introduction
In an era characterized by unprecedented technological advancements, the integration of Unmanned Aerial Vehicles (UAVs) has emerged as a transformative force across various industries. One particularly promising application lies in the domain of road monitoring and surveillance, where UAVs can be used for vehicle detection and tracking, redefining the capabilities and efficiency of traditional monitoring systems.
Traditional road monitoring systems, although effective to a certain extent, are often limited by factors such as coverage, cost, and real-time responsiveness. UAV swarms, with their ability to operate autonomously and collaboratively, offer a paradigm shift in the way we perceive and implement road surveillance. The scalability, flexibility, and adaptability of these swarms present a novel solution to address the dynamic and evolving nature of modern road networks.
The convergence of cutting-edge technologies, such as artificial intelligence, sensor miniaturization, and advanced communication systems, has paved the way for the development of UAV swarms capable of collaborative and intelligent operations. However, deploying UAVs in real-world scenarios involves significant risks and challenges, primarily related to safety and security. Therefore, robust and reliable systems must be developed and rigorously tested before real-world implementation.
This paper explores the potential of harnessing UAV swarms for road monitoring and surveillance, using the UAVs to perform vehicle detection and tracking tasks by the use of computer vision technologies and advanced simulation frameworks. This proposal aims to develop a real-time surveillance system capable of detecting ground vehicles within a designated area. An UAV swarm is an entity formed by more than one UAV that can perform coordinated tasks to solve a problem [
1], the coordination implies the execution of a shared mission, either performing common or distinct tasks. Swarm approaches like the implementations of [
2,
3] has shown an extensive operational capacity. In this study, the use of swarms offers the advantage of extended area coverage and provides multiple perspectives for data acquisition. In addition, the swarm allows for the use of collaborative search and tracking techniques [
4,
5,
6]. All of this implies that the additional data collected by the swarm can enhance the detection capabilities compared to a single UAV.
Two fundamental elements are key to implementing the proposed surveillance system: first, target detection, where sensor data are used to identify and locate objects of interest; second, tracking the detected targets over time. Improving detection accuracy directly enhances tracking performance, making advanced detection algorithms ideal for such systems. However, advanced detectors typically require additional computational resources, increasing the time needed for tracking.
For the detection, this study relies on the use of computer vision technology [
7] through cameras mounted on Unmanned Aerial Vehicles (UAVs). The study of the UAV-based surveillance systems from the literature shows two main approaches when applying computer vision models [
1,
8]. First, one-stage models are able to perform detection and classification in a single step, offering the advantage of reduced processing time, which is critical for real-time applications. Secondly, the two-stage models separate the detection and identification tasks, potentially improving accuracy but at the cost of increased computation time.
The speed provided by one-stage models is ideal for the desired tracking task, but their counterpart of a worse accuracy output in the detection can produce an underperforming tracking within the surveillance. One way to counteract this effect is to use distributed surveillance systems that allow, through data fusion, the use of less accurate detections but still achieve a good tracking output.
The primary goal of the proposed swarm-based system is to implement a system capable of distributing the surveillance tasks between the different UAVs of the swarm, then using faster detection algorithms altogether to the fusion system to implement a quality tracking output that is comparable to the use of a slower detector in a single UAV.
To accomplish this goal is necessary to evaluate two main factors, on the one hand, the detection algorithms should provide enough quality when performing the real-time tracking problem. On the other hand, the swarm configuration and the integration of multiple UAVs to perform the monitoring of the desired area evaluate how the addition of new UAVs can improve the tracking performance.
For the swarm configuration, the proposal presented in this paper is to use a centralized control station to allow UAV coordination and common processing of the captured data [
1]. This centralized approach facilitates the use of data fusion techniques to integrate information from multiple UAVs, thereby enhancing the overall surveillance output. These swarm capabilities will be evaluated with a designed set of scenarios that allow the evaluation of the addition of new UAVs into the system to generate additional sources of information.
To evaluate the performance of the system is necessary to implement a one-stage algorithm that can operate in a real-time problem. The You Only Look Once (YOLO) model is one of the main representatives of one-stage models as it is widely used in the literature for its capabilities on real-time detection for the balance between speed and accuracy when applied to different computer vision problems. For all these reasons, it has been selected to test and evaluate the swarm configuration and the improvements produced by including more sensors in the system.
As an initial experiment, this study aims to evaluate the challenges a high-precision, two-stage model faces when applied in a real-time context. To test this approach, we utilize the Segment Anything Model (SAM), a recent advancement in image segmentation known for its strong generalization across various applications. However, as a segmentation-only model, SAM requires an additional stage to perform vehicle identification, which is essential for tracking. By comparing the performance of both the YOLO model and SAM under identical conditions, a high-precision model is a suitable detection algorithm for the proposed surveillance task.
The results show how as expected YOLO speed outperforms SAM, being the better approach for real-time applications. By evaluating the results of the detection algorithm, an in-depth analysis has been performed on the YOLO architecture to evaluate the different configurations of said model inside the swarm system.
To evaluate the proposed system, the proposal is to consider both the computation time as a measure of the real-time needs of the problem and an accuracy metric based on the Multi Object Tracking Accuracy (MOTA) metric [
9] capable of evaluating the tracking problem. The combination of both metrics can provide insights into the trade-offs between real-time performance and detection precision.
All testing of the system was conducted within a highly realistic virtual environment using a Software-In-The-Loop (SITL) testing strategy. This approach ensures safety while allowing for extensive experimentation, eliminating the risks associated with real-world testing. The primary objective of this study is to validate the correct application of algorithms and problem-solving approaches before real-world deployment. Although the simulation and experimental setup aim to closely replicate real-world conditions, they are specifically designed to provide a safe and controlled testing environment.
In summary, the key proposals for the current study are as follows:
Using UAV Swarms for Enhanced Road Surveillance: UAV swarms are proposed as an alternative to traditional road monitoring systems, offering improved coverage, flexibility, and responsiveness. These swarms are expected to provide a more scalable and adaptable solution for monitoring dynamic road networks.
Improved one-stage Real-Time Detection and Tracking system using Computer Vision: The system uses UAVs with mounted cameras to detect and track vehicles in real time. With the use of the swarm, the one-stage models are improved to obtain high accuracy in a short processing time.
Centralized Control and Data Fusion for Swarm Coordination: The UAV swarm system employs a centralized control station to coordinate the UAVs and consolidate data. This centralized approach facilitates the fusion of data from multiple UAVs, enhancing tracking accuracy through distributed sensing and collaborative analysis.
To test everything the paper presents an evaluation using a Virtual Environment with Software-In-The-Loop (SITL) simulation to safely evaluate the proposed system. A set of scenarios replicating real-world situations is then processed and analyzed, comparing different YOLO configurations, and the RT-DETR algorithm to evaluate the performance on real-time computation. The system’s performance is assessed using a combination of computation time (for real-time capability) and Multi Object Tracking Accuracy (MOTA) to gauge the balance between real-time responsiveness and tracking accuracy.
This study is organized as follows:
Section 2 presents a study of the literature related to road surveillance, evaluating the most used algorithm and delving into the problem from the UAV perspective, looking also into the swarm capabilities.
Section 3 presents the proposed system for the UAV-based tracking and surveillance problem, detailing the needed simulation components. Finally,
Section 4 details the experimentation of the proposed system while
Section 5 provides the study conclusions.
3. System
The proposed UAV-based surveillance system consists of two main subsystems. The first subsystem is the swarm configuration and UAV control, which manages the deployment of UAVs. The second subsystem comprises the computer vision algorithms used for detecting and tracking ground vehicles. The aim is to integrate both subsystems to create a more robust and efficient solution for road surveillance.
The swarm approach is designed to leverage the additional sensors provided by the inclusion of multiple UAVs. These additional sensors enable the capture of more comprehensive data, which can be used to enhance the tracking of ground vehicles. Moreover, a swarm configuration offers several advantages over a single UAV, including extended area coverage, increased redundancy, and the reduction of blind spots caused by obstacles within the mission environment.
In addition, for the swarm control, it is important to configure the simulation to replicate the control performed in a real-world scenario. The use of UAVs in the real world is highly restricted due to the inherent risks associated with the potential misuse of this technology. Consequently, the safety and security of the operations that involve UAVs must be guaranteed before the implementation of a real-world system. To achieve this, the proposal is to configure a realistic simulator on which the UAVs will behave as expected in the real world.
The car detection and tracking proposal implies the use of cameras mounted on each UAV and the application of a computer vision approach to perform ground vehicle detection. The implemented solution allows for the testing of different computer vision algorithms and the corresponding configurations. These algorithms must be adapted for the detection of ground vehicles from aerial images captured by UAVs. Then, the tracking identifies and follows the vehicles over time. Maintaining a consistent vehicle ID across multiple detections, generating trajectories that can be utilized in future decision-making processes used for the proposed surveillance system.
Apart from the two main components, it is important to remark on the need for a trajectory data fusion system to unify all information provided by the different UAVs into a single tracking output.
3.1. Swarm System Overview
The proposal for this paper is to use a virtual simulation that replicates the complete behavior of the real-world system. By employing such simulations, it becomes feasible to develop and thoroughly test the system’s functionalities and operational efficacy in a controlled environment prior to deployment in actual operational settings.
Figure 1. Presents an example of the images captured by different swarm members.
The system proposed in this study is implemented over a framework that has been used in previous works [
1]. The proposal is to perform a SITL simulation, that includes a simulation environment configurable with multiple ground vehicles that can be used as a target for the detection and tracking functionalities of the proposed system. The simulation is achieved using the AirSim framework applied over Unreal Engine to achieve photorealism and realistic UAV physics. This framework has been chosen for the high-fidelity images that can be produced with the Unreal Engine software, being one of the best options for UAV simulators and specifically those systems designed for computer vision approaches.
This AirSim framework is also prepared for multiple UAV simulations, thus allowing for the swarm configuration. To implement each UAV control, the proposal is to use the Px4 flight control software; this software is used inside real-world flight controllers that can be mounted onto UAVs. Consequently, the control logic governing the simulated UAV mirrors that of its real-world counterparts, thereby facilitating future extensions of the proposed system. In addition, the PX4 flight control software also allows for missions with a high spectrum of configuration, allowing multiple UAVs, different sensors, and even the option of HITL simulations in future developments. With this software, it is possible to induce complex flight missions on the UAV, including communication with the control station.
An UAV swarm mission can be configured as either a centralized or decentralized system [
1]. In the first one, all communications are produced between the UAVs and a control station while all decisions are performed by the control station; in the second one, the UAVs can communicate information to other UAVs within the system and can operate autonomously by coordinating themselves to take decisions without the intervention of the control station. In this context, a centralized control station offers advantages in terms of precise positioning, coordination of the swarm, and the application of data fusion techniques to generate a unified tracking output for all UAVs in the system. Conversely, decentralized missions provide autonomy to mission agents, automatic configuration, and response to mission UAV failures, enabling autonomous repositioning and replanning.
The centralized swarm approach has the advantages of more efficient mission control as the data are processed at a single point; a simplified UAV mission design, as all systems behave equally; and reduced communication requirements, as there is no data change between UAVs. On the other hand, the centralized control station introduces a single point failure that could endanger all the missions when the control station fails, a reduced autonomy of the UAVs, and a restricted scalability when the load of controlling a large number of UAVs is incremented.
Decentralized swarms also imply coordination difficulties as the synchronized actions of all swarm members can be complicated with large swarms. Within the study presented in this paper, the proposal is to apply a centralized swarm due to the robustness of UAV coordination and the efficient mission control that allows fusing all data in a single point. However, future work could explore the use of a decentralized approach to enable a more autonomous system capable of handling multiple tasks performed by different UAVs, and to increase problem scalability, allowing for introducing large quantities of UAVs in complex scenarios. This approach will especially help in situations with high-density traffic in which the limitations of fewer UAVs are more appreciated.
The centralized control of the swarm is facilitated through a ground control station module which will be responsible for task allocation to each individual UAV. For the proposed problem, the UAV’s tasks include positioning into a desired location and an image recovery process. For this, the control station must consider all UAV’s positions and movements to avoid collisions and to achieve a safe mission. In addition, the control station can also apply tracking and data fusion algorithms to obtain an improved output for the surveillance problem.
Figure 2 presents an overview of the mission to be performed for the surveillance problem; the control station should first launch the control and communication threads for each UAV. Then, plan the positions on which each UAV introduced in the swarm must be placed and assign the corresponding movement mission to the different UAVs. These movement missions must be configured to avoid collisions between different UAVs and should be made, taking into account the current scenario environment. It is important to note that the same mission can be modified depending on the number of UAVs to be used, selecting different positions to acquire an extended coverage of the monitored area.
Once the mission is started, each UAV autonomously will undertake the designated movements, starting with the takeoff and navigation toward a predetermined target position. Upon reaching this position, the UAV enters a standby state, awaiting directives from the ground control station to initiate road surveillance.
Once all UAVs are positioned the control station can command the start of the data acquisition phase, initiating the road surveillance. During this phase, the UAV orients its camera towards the specified measuring position and captures images at a pre-defined frequency.
The information obtained by each UAV is formed by the captured images alongside metadata information such as capture time, UAV’s spatial coordinates, and came camera’s yaw, pitch, and roll parameters. With this information, the tracking system is able to detect vehicles in the images and assign precise positions to them. By continuously tracking the vehicles across multiple consecutive frames, the system can identify and follow their movements along the road.
In addition, the use of data fusion techniques on the individual tracking outputs allows the system to generate a consolidated trajectory for each detected vehicle. The quality of these trajectories is expected to improve with the inclusion of additional UAVs in the system, as the increased data from the swarm enhances the accuracy and overall performance of the surveillance task.
The resulting trajectories furnish essential information for the proposed surveillance problem. In future work, a decision-making module can be implemented to take advantage of this information and assign new tasks to the UAVs. For example, it is possible to send part of the UAV swarm to new positions or to start following a specific vehicle for a more complex surveillance mission. In addition, with the current output, it is possible to send computed information to a human operator that can take action depending on the surveillance results.
In addition to this decision-making step, the ground control station should also finish the mission. To do so, once the road surveillance task is accomplished, the control station will start the ending process, sending a new movement mission for each UAV to return to the specified home location and land.
UAV Control and Simulation
For controlling the actions of each UAV, the proposed system uses the mavSDK control software, version 2.5. This tool will serve as the control station for the system, facilitating communication between the PX4 flight control software (version 1.14) and the AirSim simulation environment (version 1.8.1).
Renowned for its widespread adoption in UAV control applications, mavSDK is aligned with the PX4 software and can be used inside on-board computers, thereby enabling fully autonomous UAV missions. Consequently, it stands as an optimal choice for the proposed experimentation, with potential expansions envisaged in future HITL or VITL implementations.
This software enables the operation of UAVs by executing command and control operations, as well as obtaining mission parameters such as UAV waypoints, movement control directives, and mission duration specifications. Additionally, it allows for configuring the image capture process, including setting the desired capture frequency and saving essential data for the surveillance task, such as images and metadata, including the image capture locations.
Using real-world control software for the UAV and ground control station, the proposed simulation environment is therefore configured to ensure a high level of realism comparable to real-world scenarios. This allows for safe and secure SITL testing of all the proposed implementations within the purview of this study. Although it is important to take into account that the simulation does not include all real-world elements, it is necessary in future works to evaluate the problem with new testing that uses HITL and VITL testing to address problems that are not considered in this initial approach.
To provide an overview of the implemented simulation architecture,
Figure 3 presents the different modules within the system and the links existing between them. The simulation computer serves as the focal point and encompasses the configuration of the AirSim framework as the UAV simulator, alongside the Unreal Engine serving as the 3D engine for physics and image rendering. Given the initial design as a SITL system, the PX4 software is also integrated into this computer. However, for future iterations, the flight controller software module will need to be transitioned to an external hardware flight controller.
It is important to note that the ground control station is presented as a separate module from the simulation; however, both subsystems can be implemented on the same computer for Software-In-The-Loop (SITL) testing. Alternatively, the control station software can also operate on a different connected hardware platform.
Within this ground control station, the mission implementation is based on MavSDK. The mission as previously described encompasses the positioning and data acquisition processes. In addition, the control station can operate the tracking module using the outputs provided by the simulated sensors. This tracking module can be applied as a real-time or offline solution, depending on the immediate or delayed processing of the images. The proposed surveillance problem should be applied as real-time processing to perform road monitoring. For future improvements to the system, it is possible to include a decision-making process to modify the UAV mission by assigning new tasks to be performed in response to the surveillance output. More examples of the implemented code for UAV navigation based on PX4 architecture are shown in the SIMBAT project, with proofs of concept of solutions based on UAVs in related areas [
81].
3.2. Vehicle Detection
The literature on vehicle detection based on computer vision presents two primary approaches: employing a single algorithm to handle both detection and identification in a unified stage or utilizing multiple algorithms to separate detection and identification. The use of different stages has the objective of improving the accuracy of the detection although it implies additional computation time to perform the different phases.
The system presented in this paper has the ability to accommodate either approach as the detection algorithm can function separately from the tracking module, allowing for the integration of various algorithms within the detection and tracking module. However, the proposal for this study is to demonstrate how a faster algorithm can acquire equivalent results to a more precise algorithm through the use of multiple UAVs and a fusion system.
For this study, the YOLO algorithm has been selected as an example of a one-stage algorithm. YOLO models are recognized as widely adopted and extensively tested algorithms in the literature, making them among the most commonly used solutions for various computer vision problems, particularly in detection and tracking applications. In relation to the proposed problem, these models have demonstrated commendable performance across diverse domains, with the primary advantage of short processing times for the tasks at hand. While the algorithm primarily focuses on object detection and classification, YOLO models also exhibit proficiency in segmentation. The API provided with version 8 of this algorithm (YOLOv8) includes connectivity to other applications, such as segmentation models and multi-object tracking, making it a useful option for enhancing this study with additional improvements to the overall tracking system.
Using the YOLO provided API, it is possible to easily link a segmentation model that will transform the one-stage implementation of the detection model into a two-stage configuration. The SAM algorithm is an emerging algorithm that presents a robust segmentation approach without requiring specific training for class recognition but has the disadvantage of needing a second-stage algorithm that would perform the identification step capable of classifying all detected vehicles in an image. By using this model alongside the YOLO classification options, it is possible to generate a robust two-stage implementation to which the one-stage proposal is compared.
SAMs have showcased efficacy in various problem domains related to this study, being able to perform the segmentation of moving vehicles and low-contrast scenes. It is also adapted to real-world scenes and is even able to use different image sensors like thermal infrared [
44]. With all these characteristics alongside the easy connection to the YOLO framework, the SAMs seem suitable for integration into the proposed car detection system. Their main disadvantage is the additional computer time required not only for the specific segmentation step but also for the entire detection and identification process.
In addition, to evaluate the obtained results a second one-stage detection algorithm is proposed. RT-DETR is a real-time detection algorithm that is based on transformer vision architecture. It has shown high accuracy results while maintaining real-time capabilities on different object detection problems [
24,
82].
3.3. Vehicle Tracking
After the detection process, the tracking system is applied to generate trajectories for the different detected ground vehicles. The main goal generate and maintain vehicle IDs across multiple detections, facilitating trajectory generation for the decision-making processes to be applied in future works of this line of investigation. In addition, the tracking system can provide feedback to the detection system, allowing a refinement and improvement of future detections.
To enhance the detection algorithms, the tracking system employs the already computed trajectories to perform predictions and determine possible positions of the vehicles. This strategy aims to narrow the detection area to add a second chance of reprocessing in case of missed detections.
To transform detection into waypoints usable by the tracking process, a georeferencing step should be applied. This algorithm is in charge of computing the geodetic coordinates of the pixels from the image, assigning a position to the detected vehicle. In addition, having multiple sources of information is important to consider everything when performing the tracking, so a fusion system is needed to join the information coming from each UAV in the swarm.
Figure 4 presents an overview of the detection and tracking process. The process is applied for each image generated by a UAV. Initially, a pre-trained detection algorithm is applied to the entire image.
If the algorithm fails to detect any vehicle, but there are previous predictions that indicate the presence of a vehicle, a reduced area of the image is computed in the zone of the prediction. This reduced area allows the detection algorithms to be applied on a reduced set of pixels, this can ensure previously failed detection as the prediction estimates a vehicle in that area.
In cases where no detection occurs despite previous predictions, the prediction is maintained for future detections, but there is no vehicle trajectory information generated to ensure tracking fidelity. This means that only the confirmed detections will be used for the trajectory information to be used in the future.
Once a vehicle is detected, georeferencing is applied to generate an associated position for the detection. Using this position, the trajectory data fusion process will either associate the detection with an existing trajectory or create a new trajectory. The predictions generated by the data fusion system will be utilized in future iterations of the process.
Note that the system can have multiple detection and tracking modules working at the same time, each one processing the images captured by each UAV in the system. Being the only common element, the trajectories dataset will be shared and updated by each working tracking module.
This means that the prediction of the future positions of the trajectory will be performed by the trajectory data-fusion module, using a more precise position of the global trajectory that combines all UAV measurements.
3.3.1. Georeferencing
To compute a position from the captured photographs, it is necessary to apply a georeferencing step, computing the geodetic coordinates of the pixels from the image and consequently to the detected vehicles. In a real-world problem, this process relies on camera calibration and needs metadata associated with the captured photographs, including the following:
The position of the camera at the time of capturing the image, which can be calculated through UAV position;
The specific point in space the camera was directed towards. Which can be computed with the roll, pitch, and yaw angles of the camera;
The specific Field of View (FOV) of the camera, in horizontal () and vertical () dimensions.
Within the simulation proposed for this study, all this metadata can be easily obtained from the UAV simulator and 3D engine. It is even possible to configure cameras with specific characteristics for the problem. Assuming a camera situated at (0,0,0) and pointing to a flat surface from above, the Pinhole camera model detailed in [
1] allow for transformation between pixels (
u,
v,
w) and real-world coordinates (
x,
y,
z) using the following equations.
This means that is possible to estimate the real-world position using the camera FOV and image center coordinates (
,
). Then, it is possible to apply translations using the camera position and trigonometric transformations using the roll, pitch, and yaw angles of the camera [
1].
Note that to solve this problem, the calibration process of the camera is an essential step, although the simulation allows for an easier approach by providing the intrinsic parameters of the camera.
3.3.2. Trajectory Data Fusion
The trajectory data fusion process consists of two steps, as illustrated in
Figure 5. The elements highlighted in white apply individually to each UAV, while the elements highlighted in gray are common to all UAVs in the system. Additionally, the system is linked to the UAV control system to request new images from the UAVs, which will be incorporated into the vehicle detection process, thereby initiating the loop once again.
Once a new position is introduced coming from the georeferencing inside the vehicle tracking process, for each of the current trajectories in the system, the data fusion uses the predicted positions and compares it to the new detection using the Euclidean distance function to consider if it belongs to a previously generated trajectory.
When the computed distance is under a defined threshold, the prediction and the detection are considered as the same trajectory. In that case, the new position will be added to the previous existing track. Otherwise, as there is no previous prediction that matches the new detection and consequently, a new track is generated for future detections.
To associate the positions computed on each UAV into a single trajectory, the proposal highlighted in gray inside
Figure 5 involves an implementation based on the Munkres algorithm. With this proposal, it is possible to select the most optimal global trajectory to be associated with the new local detection.
The local vehicle detection process applied for each UAV image assigns local IDs for the detected cars, and then the global association process will attempt to assign each local ID to a corresponding global ID, selecting the most suitable option from the list of global trajectories.
To perform the comparison between global trajectories, a prediction will be made on each trajectory. Then, the positions detected for the local trajectories of each UAV are compared using Euclidean distance, assigning the new detection to the nearest global trajectory in case the computed Euclidean distance between the two positions falls within a predefined threshold. Note that it is possible to use not only the computed position but also kinematics like the velocity or acceleration to refine this association process.
The procedure iterates over all UAVs and vehicles until all local IDs are effectively merged into existing trajectories or new trajectories are created for the unassociated local positions. The defined threshold serves as a criterion for determining the minimum proximity required to establish a meaningful association. Through this iterative process, the proposed methodology aims to achieve a comprehensive and accurate alignment of trajectories, enhancing the overall efficacy of the system for vehicle tracking and monitoring.
Once the association is applied, the position of the global trajectory waypoints is defined by the use of a centroid of all the local positions signeted to the corresponding trajectory.
With the global trajectory, it is possible to perform a prediction for the vehicle position in the future by using the previous waypoints. To do this, it is necessary to take the position at two different timestamps and compute the vehicle velocity between the two points. Having the velocity and the last position makes it possible to compute an estimated future position.
Thanks to the use of the data fusion system, it is possible to optimize future detections taking into account the relative accuracy of inputs from swarm members. By integrating data from all the UAVs, data fusion reduces uncertainties and compensates for any limitations existing on individual perspectives of the swarm, like occlusions or blind spots. The system aligns and synchronizes data from each source, filtering out inconsistencies and enhancing vehicle detection. This leads to improved detection rates, fewer false positives, and better handling of complex environments, ultimately resulting in a more robust and effective surveillance solution.
An important factor to take into account is that the data-fusion system scalability can be limited when adding large quantities of UAVs, and the association problem can be complex when a large number of trajectories is maintained, especially in scenarios with high-density traffic. To mitigate the problem, high-end hardware with the capability of multithreading the data fusion process into multiple parallel processes can be used. Alternatively, cloud computing is also an option, although it will be necessary to ensure a high-quality internet connection in the deployment scenario to ensure real-time processing.
3.4. Real-World Challenges
When deploying the swarm in a real-world environment, several challenges can appear, affecting the swarm behavior at technical, environmental, and operational levels [
55,
83]. It is important to take into account that the current system implements only a SITL simulation that provides valuable insights into the problem. However, in the future, with HITL and VITL, it is important to take into account the real-world problems that can appear in more detailed systems.
First, when dealing with a real-world implementation, it is essential to take into account the constraints that can appear for the communication and coordination of the swarm members. The proposed system is a centralized swarm that will depend on a centralized control station, requiring a robust connection to perform real-time data sharing and UAV coordination. In the real world, UAVs may face limited bandwidth or intermittent connectivity, introducing latency or packet loss that will impact the swarm performance. In addition, it is possible that the terrain includes physical obstruction that can produce signal interference.
To solve these problems, real-world implementation must ensure a connection with high reliability. It is even necessary to consider the implementation of a decentralized swarm to increase the scalability of all the communication protocols and the adaptability to different communication protocols.
The terrain not only challenges the communication protocols but also the navigation of UAVs, making it necessary to consider obstacle avoidance processes that ensure a safe flight for the UAVs and localization protocols that not only rely on GPS, as in certain terrains, the signal can be weak or inconsistent. In addition, harsh weather conditions (strong winds, rain, fog, low lightning) can affect the swarm’s flight capabilities and even prevent the mission execution, resulting in the need for robust sensory and control mechanisms that ensure a safe mission development.
To mitigate the communication constraints, a viable option is to include redundancies that will ensure the delivery of the message, although it is important to note that this approach will saturate the communication network, adding more constraints to the problem’s scalability. In SITL simulation, all communications are reliable and will be correctly applied, it is possible to include programs that simulate the constraints, although it is better to test those in HITL where the real hardware would be tested.
Other considerations to consider not only for swarm flights but also for single UAVs are the limited flight time due to battery capacity, security, and privacy that must be ensured by any UAV mission, especially by a video-based mission. This effect can be simulated in SITL simulations by adding time limits to the mission flight.
Finally, the possibility of failures like mechanical issues implies the need for fail-safe mechanisms that ensure the execution of the mission. Including redundancy and the ability to reassign tasks and redistribute the area coverage can ensure resilience within the swarm. This feature is also easily implementable in SITL simulation, making it possible to disconnect processes and systems during mission development to evaluate how the system will react.
Addressing these real-world challenges is crucial for the reliable and effective deployment of UAV swarms in video surveillance. Continued research and technological advancement in communication protocols, navigation systems, energy solutions, data processing techniques, and privacy-preserving methods will be essential in overcoming these obstacles and enabling robust UAV swarm surveillance in complex, real-world environments.
4. Test and Results
The proposed implementation affords flexibility for mission configuration and the testing of various proposals within the system. It enables the modification of multiple parameters within the simulation to generate diverse test scenarios and evaluate the system under different conditions. This encompasses configurations both within the proposed detection system and the simulation environment:
Detection Algorithm: The system allows for the application of different algorithms for vehicle detection and tracking. It facilitates the acquisition of data for both online and offline approaches to system operation;
UAV Configuration: The detection system can be adapted to accommodate different numbers of UAVs. This includes not only adjusting the quantity of UAVs but also selecting appropriate models and configuring onboard sensors;
Mission Configuration: The system allows for the customization of missions assigned to each UAV. This includes defining surveillance positions for each UAV within the simulation. Furthermore, potential future configurations may involve specifying different actions for individual UAVs, such as repositioning and targeting following processes;
Scenario Configuration: Beyond the confines of the proposed system, the simulation framework enables the configuration of scenarios by modifying the simulation environment and agents. This encompasses the used Unreal Engine map, including the desired road map, and defining the vehicles to be detected by the UAVs.
This metric has been implemented to consider both false positives and false negatives, allowing for an assessment of the system’s precision in identifying true vehicle locations and its effectiveness in minimizing errors. A detection is only considered “successful” when the system’s output aligns with ground truth: either by correctly identifying a vehicle near its georeferenced position or by correctly identifying the absence of a vehicle in an area. A higher ratio indicates stronger alignment with ground truth data, reflecting greater tracking accuracy and robustness in diverse scenarios.
From these configurations, the proposal presented in this study considers the evaluation of two main proposals. First, it is important to select a useful detection algorithm that can perform the surveillance tasks as desired. Second, it is important to evaluate the selected algorithm working within the swarm system.
For the first evaluation, several configurations of the chosen YOLO algorithm have been implemented and applied through the detection system. These configurations have been tested in different scenarios that represent possible situations in the real world.
For the second evaluation, the best configuration of the YOLO algorithm has been selected and inserted in specific situations in which the surveillance output is evaluated with 1 UAV, a swarm of 2 UAVs, and a swarm of 5 UAVs.
For the proposed vehicle problem, two primary metrics should be prioritized to facilitate the evaluation of the proposed solution and the selection of the most suitable detection algorithm for the diverse simulated scenarios. First, it is necessary to consider the accuracy of the vehicle detection and tracking system. Second, it is necessary to consider the computing time of the applied technique.
In the case of the accuracy metrics, it is important to assess the precision of each vehicle detection to evaluate the entire surveillance problem. The goal of this measure is to evaluate the different detection rates among algorithms across all scenarios, providing insights into their efficacy and reliability. The proposal for this study is to use a metric based on the classical approach of MOTA. The target of this study is to build a surveillance system composed of multiple UAVs so the proposed metrics should be based on the successful detections made by each UAV and the number of UAVs in the swarm.
The average tracking success (ATS) considers the number of successful detections against the total number of detections. This total number of detections should include the successful detection and the number of unsuccessful detections.
For detection to be considered successful, the ground truth of the simulation must be compared with the surveillance output. The following cases appear (
Table 1):
The computation time of the deployed algorithm is a significant metric for the proposed problem as real-time scenarios are one of the main applications for the system. Making it more relevant in scenarios where the expeditious detection of potential threats or hazards is paramount. The ability to promptly detect and respond to such incidents hinges on the efficiency of the detection algorithm in processing incoming data and generating reactions without any delay.
To compute this metric, a timestamp is registered when a new image enters the system, and a new one is registered, when the surveillance assigns a position to the detected vehicle. Averaging all timestamp differences makes it possible to compute an average processing time (APT) in seconds.
To relate both metrics, it is possible to consider which is the obtained ATS in the used APT, by performing the ATS/APT division.
Although not relevant at a SITL scenario, to ensure the comparability of the different experiments, all scenarios are executed inside the same hardware with the same conditions. The used hardware is a high-end computer that includes an AMD Ryzen 9 3900x processor, an RTX 2060 6 GB GDDR6 graphic card, and 64 GB DDR4 of RAM. With this hardware, it is possible to apply multithreading to the used software to compute the different swarm members subprocesses at the same time.
4.1. Detection Algorithm Evaluation
The proposed approach for system testing involves evaluating different proposals for the detection algorithms and comparing their respective outcomes within a defined set of scenarios. These proposals will encompass diverse detection models for the algorithms and various configurations for the overall system. This evaluation process aims to discern the efficacy and performance of the detection and tracking system under different conditions. The following scenarios have been proposed:
Scenario 1: Using a single UAV to perform the detection of two ground vehicles moving in opposing directions. With this scenario, it is possible to test the performance of the system with a single UAV. In addition, it is possible to test classical computer vision problems like the occlusions that will happen when the two ground vehicles cross paths;
Scenario 2: Using three UAVs to perform the detection of two ground vehicles moving in opposing directions. This is an expansion of Scenario 1, allowing for a comparison of the performance of the system when more UAVs are used in the detection (see
Figure 6 for examples on scenario 1 and 2);
Scenario 3: Using three UAVs to perform the detection of a single ground vehicle. This is a basic scenario to test the system in an ideal scenario without any complications for the detection;
Scenario 4: Using three UAVs to perform the detection of a single ground vehicle but performing complex maneuvers like the one presented in
Figure 7. This includes a roundabout with multiple image angles for the detected vehicle or crossing and overlapping problems. This is an expansion of Scenario 3, but in this case, complex maneuvers are introduced in the path of the ground vehicle, allowing for a test in the detection of those complex maneuvers;
Scenario 5: Using three UAVs to perform the detection of three ground vehicles. This allows experimentation on the tracking system and allows us to study how the trajectories are generated from multiple sources and how the system can handle multiple vehicles, including specific generated problems like the occlusions on computer vision;
Scenario 6: Using three UAVs inside a real simulation scenario with multiple vehicles. This allows for the full SITL scenario, introducing multiple vehicles at the same time and evaluating how the tracking system can compute all trajectories without problems (see
Figure 8).
All the proposed scenarios are developed to simulate real-world conditions like changes in perspectives, distance to targets, occlusions, and ground vehicle maneuverability. With these representations, the objective is to discern the efficacy and performance of the detection and tracking system by configuring swarms working with different algorithm implementations.
The initial evaluation focuses on the use of one-stage versus two-stage detection models. Specifically, YOLO (You Only Look Once) and SAM (Segment Anything Model) have been compared. Both models in their best-tested configurations exhibit strong performance in the detection stage in most experiments, with SAM demonstrating notable proficiency in detecting vehicles regardless of their distance.
However, the drawback of the SAM is the substantially increased computation time compared to the YOLO model. This comparison is illustrated in the table below, which shows both models’ performance under identical detection conditions. The scenarios considered include capturing a full image by the UAV and using a region of interest scenario where the detection area is reduced based on the prediction algorithm.
As can be seen in
Table 2, the generation of masks provided by the SAM takes a considerable amount of time, making it impossible to perform real-time detection processes. This longer time situation is even aggravated when taking into account that the segmented vehicles in SAM are not classified and the second stage is needed to identify the detected vehicles.
While SAM demonstrated significant potential for the given problem, the performance penalty of using this algorithm results in a fundamental consideration when considering the implementation of the proposed system in a real-world application. This issue is exacerbated by the fact that segmented vehicles in SAM are not classified, necessitating a second stage to identify the detected vehicles. This means that using SAMs at the moment seems impossible for real-time problems that require fast processing of the UAV-captured images.
In general, the performance penalty required for the two-stage detection models implies a critical challenge for real-world applications, requiring fast processing of UAV-captured images, making it less desirable in case of real-time problems that demand rapid image processing.
This does not mean that these algorithms are not useful. It only means that the algorithm should only be used in offline scenarios when the processing time is not relevant. On the other hand, the expeditious nature of YOLO and the good result in terms of detection accuracy translate to reduced detection time and enhanced real-time tracking capabilities, thus positioning the one-stage algorithms with similar characteristics as a more pragmatic tool for decision-making and road surveillance in UAV applications.
The following table presents an in-depth examination of the different YOLO architectures tested for this study. The table showcases results for the following YOLO v8 architectures trained for detection:
nano,
short,
medium,
large, and
extra-large [
84].
In addition, to consider the possibility of a segmentation model that does not include a high computation time, the possibility of using a YOLO v8 medium alongside a lightweight segmentation step [
85] is included. With this model, the target is to evaluate if the segmentation adds a high improvement to the YOLO model, being useful for the proposed solution.
The displayed results of
Table 3 highlight two primary metrics: average processing time and average tracking success. Average processing time denotes the time required, on average, for the algorithm to execute the designated task for each image captured by a UAV during the experiment. This metric serves to evaluate the algorithm’s efficiency in producing results within the system and allows for the selection of the best implementation for a real-time tracking system. Conversely, average tracking success assesses the algorithm’s performance when integrated into the tracking system, reflecting the accuracy of vehicle detection and tracking functionalities of the system. For both metrics, the best results have been highlighted with bold letters while the worst results are indicated in gray color.
Analysis of the computed metrics reveals the expected trend: larger architectures incur bigger average processing times. In all scenarios, YOLO nano presents the shortest time, while YOLO extra-large presents the biggest. The only exception to this trend is in scenario 1, where image segmentation imposes additional computation time on the YOLO medium trained for segmentation. Furthermore, the utilization of these architectures trained for segmentation does not yield satisfactory results for this problem domain. In contrast, the YOLOv8 medium architecture trained for detection demonstrates superior performance when looking at the average tracking success, showing more effectiveness for vehicle detection tasks.
Regarding the tracking success, the analysis of scenario 1 results indicates that larger architectures yield superior performance, with the extra-large architecture achieving a tracking success rate five times higher than other architectures. However, this trend does not persist when additional UAVs are introduced into the system. Subsequent experiments that include more UAVs in the system reveal that other architectures can acquire equivalent results to YOLO extra-large. Even more, YOLO large consistently outperforms the extra-large architecture in most cases.
Also, when considering the relation between both metrics with the ATS/APT operation, the YOLO short architecture outperforms the other architectures in most scenarios, meaning that the YOLO medium is the next best candidate. This means that the improvement in the ATS metric of larger architectures is not as big as the improvement in the processing time of the smaller architectures.
Consequently, the direct inference drawn from these findings is that augmenting the number of surveillance UAVs holds greater significance than solely relying on the best tracking algorithm, meaning that the data fusion of multiple sensors in the system implies more available information and consequently better results in terms of tracking success. Nevertheless, it is important to consider that in scenarios with a single UAV, optimal results are achieved with larger architectures.
Taking into account the ATS, in scenarios involving multiple UAVs, YOLO v8 large and medium architectures emerge as the most favorable options. Generally, YOLO large is preferred for its accuracy, whereas YOLO medium excels in terms of computation time. Notably, while YOLO medium yields comparable results in simpler scenarios with a single vehicle (scenarios 1, 3, and 4), YOLO large architecture proves significantly more effective in complex scenarios featuring multiple vehicles.
However, when introducing the APT into consideration, YOLO medium outperforms YOLO large and YOLO short has the best overall results. The conclusion depends on the final use of the surveillance system, if reduced processing time is essential YOLO short seems to provide the best results but with a minor increase in the processing time the bigger architectures can provide better results.
When specifically looking at the segmentation results (see
Figure 9), is possible to see that in terms of processing time the implementation without segmentation is better than the implementation with segmentation. In terms of tracking success, the segmentation implementation has less variability in its results but, in general, the no application of segmentation produces better results on this problem. This effect appears due to the errors appearing on the segmentation step (lighting, occlusions, or background clutter) that propagates to the detection algorithm, especially when including multiple vehicles of varying sizes and orientations that are not well separated by the algorithm.
These results show how a lightweight segmentation model, though can be beneficial on some occasions, can produce worse results on others, especially when including scenarios with multiple vehicles and background objects. The SAM and other advanced segmentation models can produce better results but, as already discussed, the high computational demands can make it challenging for real-time application.
Refining and advanced models with pruning techniques can produce similar results in less time. An optimal approach can be included in the fusion system, applying SAM as an ensemble method with a fast detection algorithm. In this approach, the fast detection algorithm is used by default, but a decision rule can be applied to use SAM in case the detection algorithm fails. For example, the bounding box confidence can be evaluated, or the SAM can be applied only in cases when the tracked vehicle is lost.
Also, by the inclusion of different UAVs in the system, it is possible to differentiate the tracking task. UAVs with a slower approach that uses the advanced segmentation model lose real-time capabilities and UAVs for fast detection that maintain real-time surveillance.
4.2. Swarm-Based Surveillance System Evaluation
To evaluate the use of swarms in the surveillance problem, the proposal is to assess representations of real-world scenarios with varying numbers of UAVs, considering how an increase in UAVs can enhance surveillance effectiveness.
The first scenario involves the detection of a single vehicle, serving as a baseline example of how the inclusion of UAVs can improve detection probabilities and overall surveillance output. The second scenario follows the same vehicle into a roundabout, where the detection perspectives of the UAVs can significantly change as the vehicle executes a turning maneuver.
Subsequently, additional ground vehicles will be introduced to create an occlusion scenario, where a vehicle’s view is completely obscured by another vehicle. This scenario resembles the first two but differs in that the vehicles are moving in the same direction at different speeds, thereby prolonging the occlusion duration.
This occlusion imposes restrictions on the detection capabilities of the UAV’s camera sensor, which can be mitigated through the use of additional information sources. Furthermore, the difficulty of this scenario can be enhanced by subsequently adding additional vehicles within the UAV’s detection area, including different vehicle modes to represent a real-world scenario of the detection. This allows for the testing of the surveillance of several targets simultaneously and will encompass situations involving vehicles moving in both the same and different directions, as well as multiple occlusions throughout the simulation.
These traffic scenarios, representd on
Figure 10, have been designed to represent the same situation but with an increased number of vehicles, allowing the testing of how multiple ground vehicles can reduce the surveillance solution performance.
Finally, to simulate an additional real-world condition, experiments are conducted in low-light environments to evaluate the applicability of the proposed solution in nighttime scenarios.
For the results presented in
Table 4, the first conclusion is that, as expected, the swarm improves the single UAV in all scenarios, and increasing the swarm members implies an improvement in the detection, although having similar results on 3 and 5 UAVs. Note that the processing time of the swarm members is averaged between all UAVs, having a bigger value if the total time is considered.
The occlusion situation provides an interesting result, the single UAV has a really bad result as the ground vehicle is completely lost from the UAV perspective while the swarms can operate better by having multiple perspectives.
Table 5 presents the results of a single member of the swarm to illustrate this situation. As can be seen, these results are quite better than the single UAV mission, this is due to the reason that the single UAV is located in a bad position in comparison to this swarm member to force the occlusion scenario.
The same happens to the swarm of five UAVs being outperformed by the swarm of three UAVs. Including more UAVs from a bad perspective implies worse results than a smaller swarm with members located in the best location.
This means that including additional UAVs is not the only element to be taken into account and that the situation of the UAVs inside the mission is a relevant factor to be taken into account when configuring the mission.
On the complex scenarios, another relevant conclusion is that the low lightning conditions have little effect on the surveillance output, having as expected worse results than the base scenario but with an overall good performance. However, including more ground vehicles in the simulation have more noticeable effects. This problem appears due to additional false positive detections that appear when a vehicle is detected by different UAVs but not associated with a single global trajectory.
To illustrate these results, the following
Figure 11 shows a heatmap of the ATS metric, across all different scenarios with the tested swarms of one, three, and five UAVs. As can be seen, the proposed swarm system maintains the performance of ATS in all different conditions proposed in the scenarios, making it possible to appreciate the benefit of using more UAVs to perform surveillance tracking.
To compare the results of the YOLO algorithm, the RT-DETR algorithm is proposed as an alternative one-stage model capable of obtaining high-accuracy results on real-time processing. The following
Table 6 shows the same experiments applied for the swarm with the YOLO algorithm but modifying the implementation to use the RT-DETR algorithm.
With reference to the RT-DETR results, the conclusions are similar; the inclusion of additional UAVs improves the surveillance output, but the UAV perspective remains an important factor. Being the main advantage of the use of swarms the inclusion of multiple perspectives at the same time.
To compare both algorithm’s results, the following
Figure 12 shows the ATS and APT values for each experiment. As can be observed, YOLO output has slightly better accuracy but a much-reduced computation time, obtaining a much better ATS/ATP result.
In addition, as expected, the single-vehicle detection computes the overall best results and the addition of real-world situations to the problem produces more errors in the detection, reducing the ATS value and being the most problematic element the occlusions produced by terrain elements or multiple cars in the scenario. This means that, for real-world problems, it is essential to configure the mission to provide different perspectives to obtain useful results. The main advantage of the swarm is the capability to include multiple perspectives in the solution.
To evaluate the swarm effect, the following
Figure 13 summarizes the metric results for one, three, and five UAVs. As can be seen for both algorithms, the average processing time behaves quite constant when adding more UAVs. The average tracking success improves when adding more UAVs, although it is possible to see how the results tend to plateau. There is more improvement between one and three UAVs than between three and five UAVs. This means that including more UAVs is not always the best solution as it is possible to include bad perspectives in the solution that will affect the average success value.
The swarm still provides better results than a single UAV mission, but the configuration of the mission is more important than the inclusion of a high number of UAVs. To improve this solution, an advanced fusion system can be included to perform a selection of UAV detections, discarding all local detections that provide wrong information.
Finally, to compare all results, the following
Table 7 shows the performance of the fast detection algorithms on a single vehicle detection case against the fastest detection achieved with SAM, using the reduced area image. As it can be seen, the accuracy of SAM is better but with the use of swarms, the faster algorithms achieve comparable results in a much more suitable time for the real-time surveillance problem.