Drone Swarm for Distributed Video Surveillance of Roads and Car Tracking

Sánchez Pedroche, David; Amigo, Daniel; García, Jesús; Molina, José M.; Zubasti, Pablo

doi:10.3390/drones8110695

Open AccessArticle

Drone Swarm for Distributed Video Surveillance of Roads and Car Tracking

by

David Sánchez Pedroche

^1,*

,

Daniel Amigo

^2,*

,

Jesús García

¹

,

José M. Molina

¹

and

Pablo Zubasti

³

¹

Applied Artificial Intelligence Research Group, Computer Science and Engineering Department, University Carlos III of Madrid, 28270 Colmenarejo, Spain

²

European Organisation for the Safety of Air Navigation, EUROCONTROL, 1130 Brussels, Belgium

³

Independent Researcher, 28270 Colmenarejo, Spain

^*

Authors to whom correspondence should be addressed.

Drones 2024, 8(11), 695; https://doi.org/10.3390/drones8110695

Submission received: 13 October 2024 / Revised: 11 November 2024 / Accepted: 13 November 2024 / Published: 20 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

This study proposes a swarm-based Unmanned Aerial Vehicle (UAV) system designed for surveillance tasks, specifically for detecting and tracking ground vehicles. The proposal is to assess how a system consisting of multiple cooperating UAVs can enhance performance by utilizing fast detection algorithms. Within the study, the differences in one-stage and two-stage detection models have been considered, revealing that while two-stage models offer improved accuracy, their increased computation time renders them impractical for real-time applications. Consequently, faster one-stage models, such as the tested YOLOv8 architectures, appear to be a more viable option for real-time operations. Notably, the swarm-based approach enables these faster algorithms to achieve an accuracy level comparable to that of slower models. Overall, the experimentation analysis demonstrates how larger YOLO architectures exhibit longer processing times in exchange for superior tracking success rates. However, the inclusion of additional UAVs introduced in the system outweighed the choice of the tracking algorithm if the mission is correctly configured, thus demonstrating that the swarm-based approach facilitates the use of faster algorithms while maintaining performance levels comparable to slower alternatives. However, the perspectives provided by the included UAVs hold additional significance, as they are essential for achieving enhanced results.

Keywords:

UAV surveillance; vehicle detection and tracking; UAV swarm configuration

1. Introduction

In an era characterized by unprecedented technological advancements, the integration of Unmanned Aerial Vehicles (UAVs) has emerged as a transformative force across various industries. One particularly promising application lies in the domain of road monitoring and surveillance, where UAVs can be used for vehicle detection and tracking, redefining the capabilities and efficiency of traditional monitoring systems.

Traditional road monitoring systems, although effective to a certain extent, are often limited by factors such as coverage, cost, and real-time responsiveness. UAV swarms, with their ability to operate autonomously and collaboratively, offer a paradigm shift in the way we perceive and implement road surveillance. The scalability, flexibility, and adaptability of these swarms present a novel solution to address the dynamic and evolving nature of modern road networks.

The convergence of cutting-edge technologies, such as artificial intelligence, sensor miniaturization, and advanced communication systems, has paved the way for the development of UAV swarms capable of collaborative and intelligent operations. However, deploying UAVs in real-world scenarios involves significant risks and challenges, primarily related to safety and security. Therefore, robust and reliable systems must be developed and rigorously tested before real-world implementation.

This paper explores the potential of harnessing UAV swarms for road monitoring and surveillance, using the UAVs to perform vehicle detection and tracking tasks by the use of computer vision technologies and advanced simulation frameworks. This proposal aims to develop a real-time surveillance system capable of detecting ground vehicles within a designated area. An UAV swarm is an entity formed by more than one UAV that can perform coordinated tasks to solve a problem [1], the coordination implies the execution of a shared mission, either performing common or distinct tasks. Swarm approaches like the implementations of [2,3] has shown an extensive operational capacity. In this study, the use of swarms offers the advantage of extended area coverage and provides multiple perspectives for data acquisition. In addition, the swarm allows for the use of collaborative search and tracking techniques [4,5,6]. All of this implies that the additional data collected by the swarm can enhance the detection capabilities compared to a single UAV.

Two fundamental elements are key to implementing the proposed surveillance system: first, target detection, where sensor data are used to identify and locate objects of interest; second, tracking the detected targets over time. Improving detection accuracy directly enhances tracking performance, making advanced detection algorithms ideal for such systems. However, advanced detectors typically require additional computational resources, increasing the time needed for tracking.

For the detection, this study relies on the use of computer vision technology [7] through cameras mounted on Unmanned Aerial Vehicles (UAVs). The study of the UAV-based surveillance systems from the literature shows two main approaches when applying computer vision models [1,8]. First, one-stage models are able to perform detection and classification in a single step, offering the advantage of reduced processing time, which is critical for real-time applications. Secondly, the two-stage models separate the detection and identification tasks, potentially improving accuracy but at the cost of increased computation time.

The speed provided by one-stage models is ideal for the desired tracking task, but their counterpart of a worse accuracy output in the detection can produce an underperforming tracking within the surveillance. One way to counteract this effect is to use distributed surveillance systems that allow, through data fusion, the use of less accurate detections but still achieve a good tracking output.

The primary goal of the proposed swarm-based system is to implement a system capable of distributing the surveillance tasks between the different UAVs of the swarm, then using faster detection algorithms altogether to the fusion system to implement a quality tracking output that is comparable to the use of a slower detector in a single UAV.

To accomplish this goal is necessary to evaluate two main factors, on the one hand, the detection algorithms should provide enough quality when performing the real-time tracking problem. On the other hand, the swarm configuration and the integration of multiple UAVs to perform the monitoring of the desired area evaluate how the addition of new UAVs can improve the tracking performance.

For the swarm configuration, the proposal presented in this paper is to use a centralized control station to allow UAV coordination and common processing of the captured data [1]. This centralized approach facilitates the use of data fusion techniques to integrate information from multiple UAVs, thereby enhancing the overall surveillance output. These swarm capabilities will be evaluated with a designed set of scenarios that allow the evaluation of the addition of new UAVs into the system to generate additional sources of information.

To evaluate the performance of the system is necessary to implement a one-stage algorithm that can operate in a real-time problem. The You Only Look Once (YOLO) model is one of the main representatives of one-stage models as it is widely used in the literature for its capabilities on real-time detection for the balance between speed and accuracy when applied to different computer vision problems. For all these reasons, it has been selected to test and evaluate the swarm configuration and the improvements produced by including more sensors in the system.

As an initial experiment, this study aims to evaluate the challenges a high-precision, two-stage model faces when applied in a real-time context. To test this approach, we utilize the Segment Anything Model (SAM), a recent advancement in image segmentation known for its strong generalization across various applications. However, as a segmentation-only model, SAM requires an additional stage to perform vehicle identification, which is essential for tracking. By comparing the performance of both the YOLO model and SAM under identical conditions, a high-precision model is a suitable detection algorithm for the proposed surveillance task.

The results show how as expected YOLO speed outperforms SAM, being the better approach for real-time applications. By evaluating the results of the detection algorithm, an in-depth analysis has been performed on the YOLO architecture to evaluate the different configurations of said model inside the swarm system.

To evaluate the proposed system, the proposal is to consider both the computation time as a measure of the real-time needs of the problem and an accuracy metric based on the Multi Object Tracking Accuracy (MOTA) metric [9] capable of evaluating the tracking problem. The combination of both metrics can provide insights into the trade-offs between real-time performance and detection precision.

All testing of the system was conducted within a highly realistic virtual environment using a Software-In-The-Loop (SITL) testing strategy. This approach ensures safety while allowing for extensive experimentation, eliminating the risks associated with real-world testing. The primary objective of this study is to validate the correct application of algorithms and problem-solving approaches before real-world deployment. Although the simulation and experimental setup aim to closely replicate real-world conditions, they are specifically designed to provide a safe and controlled testing environment.

In summary, the key proposals for the current study are as follows:

Using UAV Swarms for Enhanced Road Surveillance: UAV swarms are proposed as an alternative to traditional road monitoring systems, offering improved coverage, flexibility, and responsiveness. These swarms are expected to provide a more scalable and adaptable solution for monitoring dynamic road networks.
Improved one-stage Real-Time Detection and Tracking system using Computer Vision: The system uses UAVs with mounted cameras to detect and track vehicles in real time. With the use of the swarm, the one-stage models are improved to obtain high accuracy in a short processing time.
Centralized Control and Data Fusion for Swarm Coordination: The UAV swarm system employs a centralized control station to coordinate the UAVs and consolidate data. This centralized approach facilitates the fusion of data from multiple UAVs, enhancing tracking accuracy through distributed sensing and collaborative analysis.

To test everything the paper presents an evaluation using a Virtual Environment with Software-In-The-Loop (SITL) simulation to safely evaluate the proposed system. A set of scenarios replicating real-world situations is then processed and analyzed, comparing different YOLO configurations, and the RT-DETR algorithm to evaluate the performance on real-time computation. The system’s performance is assessed using a combination of computation time (for real-time capability) and Multi Object Tracking Accuracy (MOTA) to gauge the balance between real-time responsiveness and tracking accuracy.

This study is organized as follows: Section 2 presents a study of the literature related to road surveillance, evaluating the most used algorithm and delving into the problem from the UAV perspective, looking also into the swarm capabilities. Section 3 presents the proposed system for the UAV-based tracking and surveillance problem, detailing the needed simulation components. Finally, Section 4 details the experimentation of the proposed system while Section 5 provides the study conclusions.

2. Related Work

The proposed problem revolves around two primary points to be studied in the literature: the implementation of a vehicle detection and tracking system using UAVs and the deployment, control, and configuration of a UAV swarm to execute the intended surveillance mission.

2.1. Vehicle Detection and Tracking

In the field of vehicle detection and tracking, there are multiple emerging algorithms within the research community. Being one important factor in autonomous driving systems [10,11] or in surveillance systems centered around traffic management and prediction, the detection is applied to different vehicles like ground [12,13,14], maritime [15,16,17,18] or aerial vehicles [19,20,21].

To perform vehicle detection the one-stage models use a single algorithm that performs all needed tasks for the surveillance. The detection and identification of the desired target which in the case of this study will be the ground vehicles. Some examples are YOLO (You Only Look Once) [22,23], RT_DETR (Real Time—DEtection TRansformer) [24], DeepSORT [25], Faster R-CNN [26], Mask R-CNN [27], EfficientDet [28] or CenterNet [29].

The two-stage algorithms separate the localization of the target from the identification. This implies a higher prediction accuracy to the downside of having a lower computation speed [30,31]. One common approach between the two-stage algorithms is the segmentation of the captured images. With this approach, the images are segmented to perform the detection of different elements and, within the second stage, a classification is applied to identify the segmented elements located in the image. Classic segmentation algorithms include semantic [32,33] and instance [34] segmentation. With the semantic approach, each pixel is assigned to a predefined class label, while the separate instances belong to the same class in the picture. It is also possible to combine both into the panoptic segmentation [35,36,37] or use an interactive segmentation, where the user can provide an initial input to indicate the location of the target, helping the algorithm to refine the solution [38,39].

Within the computer vision detection literature, a great example of this two-stage approach is the use of the novel Segment Anything Model (SAM) [40,41,42], which allows for high-precision detection of multiple vehicles within the images but needs a secondary algorithm to perform the classification and identification of the detected vehicles. This algorithm has the advantage of exceptional effectiveness in the generalization of common scenes [43] and is able even to work in special conditions like Nighttime Self-Driving [40,43,44]. The proposal of the segment anything project is to provide a model able to perform multiple segmentation tasks as edge detection, instances segmentation, object proposal generation, or segmenting objects from freeform text [44].

Some examples of detection of ground-based vehicles from an aerial perspective can be found at [13,19,45,46,47,48] with implementations based on UAVs, but it is important to take into account similar problems that can also provide interesting approaches. For example, there are approaches that use stationary cameras to detect and monitor the road traffic [14,49,50,51,52] or ground-based autonomous vehicles detecting their operation environment [10,11,53,54].

Within all the literature specifically related to vehicle detection, one predominant technique is the use of different YOLO implementations due to the high precision of this algorithm alongside a fast speed. This algorithm is used as a one-stage algorithm in examples like [55] or as a two-stage algorithm [56] combining it with the DeepSORT tracking algorithm. The advantage of these algorithms lies in their processing speed, enabling real-time tracking tasks for online missions. The fast speed of the algorithm makes it perfect for real-time problems, and the high detection and small size allow for integration inside embedded devices [57] that can be mounted on UAVs.

Once the vehicle is detected, an important factor for the proposed problem is the application of tracking algorithms that can generate the vehicle’s trajectories over time. Typical approaches in the tracking field include the use of algorithms like DeepSORT [25], Kalman filter [58] and its improvements [54], and more advanced filters like the Interacting Multiple Model (IMM) [59,60] or particle filters [61].

Through these tracking techniques, it is possible to perform surveillance tasks as proposed in [18,60,62], where a method for detecting traffic anomalies is presented. Additionally, the proposal in [63] explores the use of drones to enhance safety and security on roads, demonstrating how UAV intervention can reduce accident rates. Most previous work focused on aerial video tracking systems are centered around videos with high frame rates and a normal size [19,20,64,65], although there are examples of large-size aerial videos like [66,67]. This is an important factor when configuring the UAV mission, to cover larger areas or to acquire more precision with a view near the vehicles. Also, an essential element to consider is the evaluation of the tracking, especially multi-object tracking, which is applied when multiple vehicles are inserted into the problem. Examples that evaluate this multi-object tracking include MOTA, IDF1, or HOTA [9,68,69].

To perform the detection task, some important problems [1] that are considered in the literature are the occlusions [70], the multi-scale [71,72], the redundancy of bounding boxes [73] or the detection speed [19].

2.2. Swarm Deployment and Configuration

The proposal of this paper is to use a swarm approach solution to perform the proposed surveillance system, making it possible to coordinate the monitoring tasks and to obtain information from all UAVs included in the swarm.

The algorithms developed in swarm systems are usually based on multi-agent systems that can provide robustness and scalability when solving the proposed mission [74,75], with a special interest in the formation control [76,77] and the swarm configuration as a centralized or decentralized system [1].

Examples of this swarm proposal can be found in the cooperative traffic monitoring system proposed in [78], the SwarmCity project proposed in [79], which investigates the efficacy of employing aerial swarm technology to monitor traffic in a simulated city environment, and the multi-agent proposal in [80].

To configure and implement a UAV mission is important to consider the dangers of the use of UAVs and the legal restrictions existing on these systems. To avoid these problems, the simulation is an important solution that allows us to perform the necessary testing before final implementation.

The testing of a UAV-based system is divided into phases depending on the implementation level and desired control over the mission. At first, to test the algorithm proposal and mission working one or several Software-In-The-Loop (SITL) simulations can be implemented to configure the mission without any risks, making it possible to test different algorithms and implementations [1]. This SITL simulation is the perfect candidate for the presented solution of this paper. As the goal is to test the viability of a swarm-based surveillance system, the SITL approach allows for the testing of different algorithms without risking the UAVs or the ground vehicles.

Future phases of the testing include the Hardware-In-The-Loop (HITL) and Vehicle-In-The-Loop (VITL) architectures; the first one implies using real Hardware for the simulation seeking to evaluate the real performance of the system in the final hardware and the second one implies the use of the entire system but using a controlled environment with a reduced set of risks.

For SITL and HITL, simulations is important to select a proper simulation software that is adjusted to the mission requirements, providing a proper configuration of the UAV model that is adjusted to the reality and a simulation environment that contains the necessary physics and image quality in case of computer vision problems.

3. System

The proposed UAV-based surveillance system consists of two main subsystems. The first subsystem is the swarm configuration and UAV control, which manages the deployment of UAVs. The second subsystem comprises the computer vision algorithms used for detecting and tracking ground vehicles. The aim is to integrate both subsystems to create a more robust and efficient solution for road surveillance.

The swarm approach is designed to leverage the additional sensors provided by the inclusion of multiple UAVs. These additional sensors enable the capture of more comprehensive data, which can be used to enhance the tracking of ground vehicles. Moreover, a swarm configuration offers several advantages over a single UAV, including extended area coverage, increased redundancy, and the reduction of blind spots caused by obstacles within the mission environment.

In addition, for the swarm control, it is important to configure the simulation to replicate the control performed in a real-world scenario. The use of UAVs in the real world is highly restricted due to the inherent risks associated with the potential misuse of this technology. Consequently, the safety and security of the operations that involve UAVs must be guaranteed before the implementation of a real-world system. To achieve this, the proposal is to configure a realistic simulator on which the UAVs will behave as expected in the real world.

The car detection and tracking proposal implies the use of cameras mounted on each UAV and the application of a computer vision approach to perform ground vehicle detection. The implemented solution allows for the testing of different computer vision algorithms and the corresponding configurations. These algorithms must be adapted for the detection of ground vehicles from aerial images captured by UAVs. Then, the tracking identifies and follows the vehicles over time. Maintaining a consistent vehicle ID across multiple detections, generating trajectories that can be utilized in future decision-making processes used for the proposed surveillance system.

Apart from the two main components, it is important to remark on the need for a trajectory data fusion system to unify all information provided by the different UAVs into a single tracking output.

3.1. Swarm System Overview

The proposal for this paper is to use a virtual simulation that replicates the complete behavior of the real-world system. By employing such simulations, it becomes feasible to develop and thoroughly test the system’s functionalities and operational efficacy in a controlled environment prior to deployment in actual operational settings. Figure 1. Presents an example of the images captured by different swarm members.

The system proposed in this study is implemented over a framework that has been used in previous works [1]. The proposal is to perform a SITL simulation, that includes a simulation environment configurable with multiple ground vehicles that can be used as a target for the detection and tracking functionalities of the proposed system. The simulation is achieved using the AirSim framework applied over Unreal Engine to achieve photorealism and realistic UAV physics. This framework has been chosen for the high-fidelity images that can be produced with the Unreal Engine software, being one of the best options for UAV simulators and specifically those systems designed for computer vision approaches.

This AirSim framework is also prepared for multiple UAV simulations, thus allowing for the swarm configuration. To implement each UAV control, the proposal is to use the Px4 flight control software; this software is used inside real-world flight controllers that can be mounted onto UAVs. Consequently, the control logic governing the simulated UAV mirrors that of its real-world counterparts, thereby facilitating future extensions of the proposed system. In addition, the PX4 flight control software also allows for missions with a high spectrum of configuration, allowing multiple UAVs, different sensors, and even the option of HITL simulations in future developments. With this software, it is possible to induce complex flight missions on the UAV, including communication with the control station.

An UAV swarm mission can be configured as either a centralized or decentralized system [1]. In the first one, all communications are produced between the UAVs and a control station while all decisions are performed by the control station; in the second one, the UAVs can communicate information to other UAVs within the system and can operate autonomously by coordinating themselves to take decisions without the intervention of the control station. In this context, a centralized control station offers advantages in terms of precise positioning, coordination of the swarm, and the application of data fusion techniques to generate a unified tracking output for all UAVs in the system. Conversely, decentralized missions provide autonomy to mission agents, automatic configuration, and response to mission UAV failures, enabling autonomous repositioning and replanning.

The centralized swarm approach has the advantages of more efficient mission control as the data are processed at a single point; a simplified UAV mission design, as all systems behave equally; and reduced communication requirements, as there is no data change between UAVs. On the other hand, the centralized control station introduces a single point failure that could endanger all the missions when the control station fails, a reduced autonomy of the UAVs, and a restricted scalability when the load of controlling a large number of UAVs is incremented.

Decentralized swarms also imply coordination difficulties as the synchronized actions of all swarm members can be complicated with large swarms. Within the study presented in this paper, the proposal is to apply a centralized swarm due to the robustness of UAV coordination and the efficient mission control that allows fusing all data in a single point. However, future work could explore the use of a decentralized approach to enable a more autonomous system capable of handling multiple tasks performed by different UAVs, and to increase problem scalability, allowing for introducing large quantities of UAVs in complex scenarios. This approach will especially help in situations with high-density traffic in which the limitations of fewer UAVs are more appreciated.

The centralized control of the swarm is facilitated through a ground control station module which will be responsible for task allocation to each individual UAV. For the proposed problem, the UAV’s tasks include positioning into a desired location and an image recovery process. For this, the control station must consider all UAV’s positions and movements to avoid collisions and to achieve a safe mission. In addition, the control station can also apply tracking and data fusion algorithms to obtain an improved output for the surveillance problem.

Figure 2 presents an overview of the mission to be performed for the surveillance problem; the control station should first launch the control and communication threads for each UAV. Then, plan the positions on which each UAV introduced in the swarm must be placed and assign the corresponding movement mission to the different UAVs. These movement missions must be configured to avoid collisions between different UAVs and should be made, taking into account the current scenario environment. It is important to note that the same mission can be modified depending on the number of UAVs to be used, selecting different positions to acquire an extended coverage of the monitored area.

Once the mission is started, each UAV autonomously will undertake the designated movements, starting with the takeoff and navigation toward a predetermined target position. Upon reaching this position, the UAV enters a standby state, awaiting directives from the ground control station to initiate road surveillance.

Once all UAVs are positioned the control station can command the start of the data acquisition phase, initiating the road surveillance. During this phase, the UAV orients its camera towards the specified measuring position and captures images at a pre-defined frequency.

The information obtained by each UAV is formed by the captured images alongside metadata information such as capture time, UAV’s spatial coordinates, and came camera’s yaw, pitch, and roll parameters. With this information, the tracking system is able to detect vehicles in the images and assign precise positions to them. By continuously tracking the vehicles across multiple consecutive frames, the system can identify and follow their movements along the road.

In addition, the use of data fusion techniques on the individual tracking outputs allows the system to generate a consolidated trajectory for each detected vehicle. The quality of these trajectories is expected to improve with the inclusion of additional UAVs in the system, as the increased data from the swarm enhances the accuracy and overall performance of the surveillance task.

The resulting trajectories furnish essential information for the proposed surveillance problem. In future work, a decision-making module can be implemented to take advantage of this information and assign new tasks to the UAVs. For example, it is possible to send part of the UAV swarm to new positions or to start following a specific vehicle for a more complex surveillance mission. In addition, with the current output, it is possible to send computed information to a human operator that can take action depending on the surveillance results.

In addition to this decision-making step, the ground control station should also finish the mission. To do so, once the road surveillance task is accomplished, the control station will start the ending process, sending a new movement mission for each UAV to return to the specified home location and land.

UAV Control and Simulation

For controlling the actions of each UAV, the proposed system uses the mavSDK control software, version 2.5. This tool will serve as the control station for the system, facilitating communication between the PX4 flight control software (version 1.14) and the AirSim simulation environment (version 1.8.1).

Renowned for its widespread adoption in UAV control applications, mavSDK is aligned with the PX4 software and can be used inside on-board computers, thereby enabling fully autonomous UAV missions. Consequently, it stands as an optimal choice for the proposed experimentation, with potential expansions envisaged in future HITL or VITL implementations.

This software enables the operation of UAVs by executing command and control operations, as well as obtaining mission parameters such as UAV waypoints, movement control directives, and mission duration specifications. Additionally, it allows for configuring the image capture process, including setting the desired capture frequency and saving essential data for the surveillance task, such as images and metadata, including the image capture locations.

Using real-world control software for the UAV and ground control station, the proposed simulation environment is therefore configured to ensure a high level of realism comparable to real-world scenarios. This allows for safe and secure SITL testing of all the proposed implementations within the purview of this study. Although it is important to take into account that the simulation does not include all real-world elements, it is necessary in future works to evaluate the problem with new testing that uses HITL and VITL testing to address problems that are not considered in this initial approach.

To provide an overview of the implemented simulation architecture, Figure 3 presents the different modules within the system and the links existing between them. The simulation computer serves as the focal point and encompasses the configuration of the AirSim framework as the UAV simulator, alongside the Unreal Engine serving as the 3D engine for physics and image rendering. Given the initial design as a SITL system, the PX4 software is also integrated into this computer. However, for future iterations, the flight controller software module will need to be transitioned to an external hardware flight controller.

It is important to note that the ground control station is presented as a separate module from the simulation; however, both subsystems can be implemented on the same computer for Software-In-The-Loop (SITL) testing. Alternatively, the control station software can also operate on a different connected hardware platform.

Within this ground control station, the mission implementation is based on MavSDK. The mission as previously described encompasses the positioning and data acquisition processes. In addition, the control station can operate the tracking module using the outputs provided by the simulated sensors. This tracking module can be applied as a real-time or offline solution, depending on the immediate or delayed processing of the images. The proposed surveillance problem should be applied as real-time processing to perform road monitoring. For future improvements to the system, it is possible to include a decision-making process to modify the UAV mission by assigning new tasks to be performed in response to the surveillance output. More examples of the implemented code for UAV navigation based on PX4 architecture are shown in the SIMBAT project, with proofs of concept of solutions based on UAVs in related areas [81].

3.2. Vehicle Detection

The literature on vehicle detection based on computer vision presents two primary approaches: employing a single algorithm to handle both detection and identification in a unified stage or utilizing multiple algorithms to separate detection and identification. The use of different stages has the objective of improving the accuracy of the detection although it implies additional computation time to perform the different phases.

The system presented in this paper has the ability to accommodate either approach as the detection algorithm can function separately from the tracking module, allowing for the integration of various algorithms within the detection and tracking module. However, the proposal for this study is to demonstrate how a faster algorithm can acquire equivalent results to a more precise algorithm through the use of multiple UAVs and a fusion system.

For this study, the YOLO algorithm has been selected as an example of a one-stage algorithm. YOLO models are recognized as widely adopted and extensively tested algorithms in the literature, making them among the most commonly used solutions for various computer vision problems, particularly in detection and tracking applications. In relation to the proposed problem, these models have demonstrated commendable performance across diverse domains, with the primary advantage of short processing times for the tasks at hand. While the algorithm primarily focuses on object detection and classification, YOLO models also exhibit proficiency in segmentation. The API provided with version 8 of this algorithm (YOLOv8) includes connectivity to other applications, such as segmentation models and multi-object tracking, making it a useful option for enhancing this study with additional improvements to the overall tracking system.

Using the YOLO provided API, it is possible to easily link a segmentation model that will transform the one-stage implementation of the detection model into a two-stage configuration. The SAM algorithm is an emerging algorithm that presents a robust segmentation approach without requiring specific training for class recognition but has the disadvantage of needing a second-stage algorithm that would perform the identification step capable of classifying all detected vehicles in an image. By using this model alongside the YOLO classification options, it is possible to generate a robust two-stage implementation to which the one-stage proposal is compared.

SAMs have showcased efficacy in various problem domains related to this study, being able to perform the segmentation of moving vehicles and low-contrast scenes. It is also adapted to real-world scenes and is even able to use different image sensors like thermal infrared [44]. With all these characteristics alongside the easy connection to the YOLO framework, the SAMs seem suitable for integration into the proposed car detection system. Their main disadvantage is the additional computer time required not only for the specific segmentation step but also for the entire detection and identification process.

In addition, to evaluate the obtained results a second one-stage detection algorithm is proposed. RT-DETR is a real-time detection algorithm that is based on transformer vision architecture. It has shown high accuracy results while maintaining real-time capabilities on different object detection problems [24,82].

3.3. Vehicle Tracking

After the detection process, the tracking system is applied to generate trajectories for the different detected ground vehicles. The main goal generate and maintain vehicle IDs across multiple detections, facilitating trajectory generation for the decision-making processes to be applied in future works of this line of investigation. In addition, the tracking system can provide feedback to the detection system, allowing a refinement and improvement of future detections.

To enhance the detection algorithms, the tracking system employs the already computed trajectories to perform predictions and determine possible positions of the vehicles. This strategy aims to narrow the detection area to add a second chance of reprocessing in case of missed detections.

To transform detection into waypoints usable by the tracking process, a georeferencing step should be applied. This algorithm is in charge of computing the geodetic coordinates of the pixels from the image, assigning a position to the detected vehicle. In addition, having multiple sources of information is important to consider everything when performing the tracking, so a fusion system is needed to join the information coming from each UAV in the swarm.

Figure 4 presents an overview of the detection and tracking process. The process is applied for each image generated by a UAV. Initially, a pre-trained detection algorithm is applied to the entire image.

If the algorithm fails to detect any vehicle, but there are previous predictions that indicate the presence of a vehicle, a reduced area of the image is computed in the zone of the prediction. This reduced area allows the detection algorithms to be applied on a reduced set of pixels, this can ensure previously failed detection as the prediction estimates a vehicle in that area.

In cases where no detection occurs despite previous predictions, the prediction is maintained for future detections, but there is no vehicle trajectory information generated to ensure tracking fidelity. This means that only the confirmed detections will be used for the trajectory information to be used in the future.

Once a vehicle is detected, georeferencing is applied to generate an associated position for the detection. Using this position, the trajectory data fusion process will either associate the detection with an existing trajectory or create a new trajectory. The predictions generated by the data fusion system will be utilized in future iterations of the process.

Note that the system can have multiple detection and tracking modules working at the same time, each one processing the images captured by each UAV in the system. Being the only common element, the trajectories dataset will be shared and updated by each working tracking module.

This means that the prediction of the future positions of the trajectory will be performed by the trajectory data-fusion module, using a more precise position of the global trajectory that combines all UAV measurements.

3.3.1. Georeferencing

To compute a position from the captured photographs, it is necessary to apply a georeferencing step, computing the geodetic coordinates of the pixels from the image and consequently to the detected vehicles. In a real-world problem, this process relies on camera calibration and needs metadata associated with the captured photographs, including the following:

The position of the camera at the time of capturing the image, which can be calculated through UAV position;
The specific point in space the camera was directed towards. Which can be computed with the roll, pitch, and yaw angles of the camera;
The specific Field of View (FOV) of the camera, in horizontal ( $f_{x}$ ) and vertical ( $f_{y}$ ) dimensions.

Within the simulation proposed for this study, all this metadata can be easily obtained from the UAV simulator and 3D engine. It is even possible to configure cameras with specific characteristics for the problem. Assuming a camera situated at (0,0,0) and pointing to a flat surface from above, the Pinhole camera model detailed in [1] allow for transformation between pixels (u, v, w) and real-world coordinates (x, y, z) using the following equations.

[\begin{matrix} u \\ v \\ w \end{matrix}] = \underset{A}{\underset{︸}{[\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}]}} [\begin{matrix} x \\ y \\ z \end{matrix}]

(1)

[\begin{matrix} x \\ y \end{matrix}] = [\begin{matrix} z \frac{u - c_{x}}{f_{x}} \\ y \frac{u - c_{y}}{f_{y}} \end{matrix}]

(2)

This means that is possible to estimate the real-world position using the camera FOV and image center coordinates (

c_{x}

,

c_{y}

). Then, it is possible to apply translations using the camera position and trigonometric transformations using the roll, pitch, and yaw angles of the camera [1].

Note that to solve this problem, the calibration process of the camera is an essential step, although the simulation allows for an easier approach by providing the intrinsic parameters of the camera.

3.3.2. Trajectory Data Fusion

The trajectory data fusion process consists of two steps, as illustrated in Figure 5. The elements highlighted in white apply individually to each UAV, while the elements highlighted in gray are common to all UAVs in the system. Additionally, the system is linked to the UAV control system to request new images from the UAVs, which will be incorporated into the vehicle detection process, thereby initiating the loop once again.

Once a new position is introduced coming from the georeferencing inside the vehicle tracking process, for each of the current trajectories in the system, the data fusion uses the predicted positions and compares it to the new detection using the Euclidean distance function to consider if it belongs to a previously generated trajectory.

When the computed distance is under a defined threshold, the prediction and the detection are considered as the same trajectory. In that case, the new position will be added to the previous existing track. Otherwise, as there is no previous prediction that matches the new detection and consequently, a new track is generated for future detections.

To associate the positions computed on each UAV into a single trajectory, the proposal highlighted in gray inside Figure 5 involves an implementation based on the Munkres algorithm. With this proposal, it is possible to select the most optimal global trajectory to be associated with the new local detection.

The local vehicle detection process applied for each UAV image assigns local IDs for the detected cars, and then the global association process will attempt to assign each local ID to a corresponding global ID, selecting the most suitable option from the list of global trajectories.

To perform the comparison between global trajectories, a prediction will be made on each trajectory. Then, the positions detected for the local trajectories of each UAV are compared using Euclidean distance, assigning the new detection to the nearest global trajectory in case the computed Euclidean distance between the two positions falls within a predefined threshold. Note that it is possible to use not only the computed position but also kinematics like the velocity or acceleration to refine this association process.

The procedure iterates over all UAVs and vehicles until all local IDs are effectively merged into existing trajectories or new trajectories are created for the unassociated local positions. The defined threshold serves as a criterion for determining the minimum proximity required to establish a meaningful association. Through this iterative process, the proposed methodology aims to achieve a comprehensive and accurate alignment of trajectories, enhancing the overall efficacy of the system for vehicle tracking and monitoring.

Once the association is applied, the position of the global trajectory waypoints is defined by the use of a centroid of all the local positions signeted to the corresponding trajectory.

With the global trajectory, it is possible to perform a prediction for the vehicle position in the future by using the previous waypoints. To do this, it is necessary to take the position at two different timestamps and compute the vehicle velocity between the two points. Having the velocity and the last position makes it possible to compute an estimated future position.

Thanks to the use of the data fusion system, it is possible to optimize future detections taking into account the relative accuracy of inputs from swarm members. By integrating data from all the UAVs, data fusion reduces uncertainties and compensates for any limitations existing on individual perspectives of the swarm, like occlusions or blind spots. The system aligns and synchronizes data from each source, filtering out inconsistencies and enhancing vehicle detection. This leads to improved detection rates, fewer false positives, and better handling of complex environments, ultimately resulting in a more robust and effective surveillance solution.

An important factor to take into account is that the data-fusion system scalability can be limited when adding large quantities of UAVs, and the association problem can be complex when a large number of trajectories is maintained, especially in scenarios with high-density traffic. To mitigate the problem, high-end hardware with the capability of multithreading the data fusion process into multiple parallel processes can be used. Alternatively, cloud computing is also an option, although it will be necessary to ensure a high-quality internet connection in the deployment scenario to ensure real-time processing.

3.4. Real-World Challenges

When deploying the swarm in a real-world environment, several challenges can appear, affecting the swarm behavior at technical, environmental, and operational levels [55,83]. It is important to take into account that the current system implements only a SITL simulation that provides valuable insights into the problem. However, in the future, with HITL and VITL, it is important to take into account the real-world problems that can appear in more detailed systems.

First, when dealing with a real-world implementation, it is essential to take into account the constraints that can appear for the communication and coordination of the swarm members. The proposed system is a centralized swarm that will depend on a centralized control station, requiring a robust connection to perform real-time data sharing and UAV coordination. In the real world, UAVs may face limited bandwidth or intermittent connectivity, introducing latency or packet loss that will impact the swarm performance. In addition, it is possible that the terrain includes physical obstruction that can produce signal interference.

To solve these problems, real-world implementation must ensure a connection with high reliability. It is even necessary to consider the implementation of a decentralized swarm to increase the scalability of all the communication protocols and the adaptability to different communication protocols.

The terrain not only challenges the communication protocols but also the navigation of UAVs, making it necessary to consider obstacle avoidance processes that ensure a safe flight for the UAVs and localization protocols that not only rely on GPS, as in certain terrains, the signal can be weak or inconsistent. In addition, harsh weather conditions (strong winds, rain, fog, low lightning) can affect the swarm’s flight capabilities and even prevent the mission execution, resulting in the need for robust sensory and control mechanisms that ensure a safe mission development.

To mitigate the communication constraints, a viable option is to include redundancies that will ensure the delivery of the message, although it is important to note that this approach will saturate the communication network, adding more constraints to the problem’s scalability. In SITL simulation, all communications are reliable and will be correctly applied, it is possible to include programs that simulate the constraints, although it is better to test those in HITL where the real hardware would be tested.

Other considerations to consider not only for swarm flights but also for single UAVs are the limited flight time due to battery capacity, security, and privacy that must be ensured by any UAV mission, especially by a video-based mission. This effect can be simulated in SITL simulations by adding time limits to the mission flight.

Finally, the possibility of failures like mechanical issues implies the need for fail-safe mechanisms that ensure the execution of the mission. Including redundancy and the ability to reassign tasks and redistribute the area coverage can ensure resilience within the swarm. This feature is also easily implementable in SITL simulation, making it possible to disconnect processes and systems during mission development to evaluate how the system will react.

Addressing these real-world challenges is crucial for the reliable and effective deployment of UAV swarms in video surveillance. Continued research and technological advancement in communication protocols, navigation systems, energy solutions, data processing techniques, and privacy-preserving methods will be essential in overcoming these obstacles and enabling robust UAV swarm surveillance in complex, real-world environments.

4. Test and Results

The proposed implementation affords flexibility for mission configuration and the testing of various proposals within the system. It enables the modification of multiple parameters within the simulation to generate diverse test scenarios and evaluate the system under different conditions. This encompasses configurations both within the proposed detection system and the simulation environment:

Detection Algorithm: The system allows for the application of different algorithms for vehicle detection and tracking. It facilitates the acquisition of data for both online and offline approaches to system operation;
UAV Configuration: The detection system can be adapted to accommodate different numbers of UAVs. This includes not only adjusting the quantity of UAVs but also selecting appropriate models and configuring onboard sensors;
Mission Configuration: The system allows for the customization of missions assigned to each UAV. This includes defining surveillance positions for each UAV within the simulation. Furthermore, potential future configurations may involve specifying different actions for individual UAVs, such as repositioning and targeting following processes;
Scenario Configuration: Beyond the confines of the proposed system, the simulation framework enables the configuration of scenarios by modifying the simulation environment and agents. This encompasses the used Unreal Engine map, including the desired road map, and defining the vehicles to be detected by the UAVs.

This metric has been implemented to consider both false positives and false negatives, allowing for an assessment of the system’s precision in identifying true vehicle locations and its effectiveness in minimizing errors. A detection is only considered “successful” when the system’s output aligns with ground truth: either by correctly identifying a vehicle near its georeferenced position or by correctly identifying the absence of a vehicle in an area. A higher ratio indicates stronger alignment with ground truth data, reflecting greater tracking accuracy and robustness in diverse scenarios.

From these configurations, the proposal presented in this study considers the evaluation of two main proposals. First, it is important to select a useful detection algorithm that can perform the surveillance tasks as desired. Second, it is important to evaluate the selected algorithm working within the swarm system.

For the first evaluation, several configurations of the chosen YOLO algorithm have been implemented and applied through the detection system. These configurations have been tested in different scenarios that represent possible situations in the real world.

For the second evaluation, the best configuration of the YOLO algorithm has been selected and inserted in specific situations in which the surveillance output is evaluated with 1 UAV, a swarm of 2 UAVs, and a swarm of 5 UAVs.

For the proposed vehicle problem, two primary metrics should be prioritized to facilitate the evaluation of the proposed solution and the selection of the most suitable detection algorithm for the diverse simulated scenarios. First, it is necessary to consider the accuracy of the vehicle detection and tracking system. Second, it is necessary to consider the computing time of the applied technique.

In the case of the accuracy metrics, it is important to assess the precision of each vehicle detection to evaluate the entire surveillance problem. The goal of this measure is to evaluate the different detection rates among algorithms across all scenarios, providing insights into their efficacy and reliability. The proposal for this study is to use a metric based on the classical approach of MOTA. The target of this study is to build a surveillance system composed of multiple UAVs so the proposed metrics should be based on the successful detections made by each UAV and the number of UAVs in the swarm.

The average tracking success (ATS) considers the number of successful detections against the total number of detections. This total number of detections should include the successful detection and the number of unsuccessful detections.

A T S = \frac{\sum_{U A V i d} \frac{S u c c e s s f u l d e t e c t i o n s}{T o t a l d e t e c t i o n s}}{N u m b e r o f U A V s}

(3)

For detection to be considered successful, the ground truth of the simulation must be compared with the surveillance output. The following cases appear (Table 1):

The computation time of the deployed algorithm is a significant metric for the proposed problem as real-time scenarios are one of the main applications for the system. Making it more relevant in scenarios where the expeditious detection of potential threats or hazards is paramount. The ability to promptly detect and respond to such incidents hinges on the efficiency of the detection algorithm in processing incoming data and generating reactions without any delay.

To compute this metric, a timestamp is registered when a new image enters the system, and a new one is registered, when the surveillance assigns a position to the detected vehicle. Averaging all timestamp differences makes it possible to compute an average processing time (APT) in seconds.

A P T = \frac{\sum_{U A V i d} \frac{\sum (e n d t i m e - s t a r t t i m e)}{T o t a l i m a g e s}}{N u m b e r o f U A V s}

(4)

To relate both metrics, it is possible to consider which is the obtained ATS in the used APT, by performing the ATS/APT division.

Although not relevant at a SITL scenario, to ensure the comparability of the different experiments, all scenarios are executed inside the same hardware with the same conditions. The used hardware is a high-end computer that includes an AMD Ryzen 9 3900x processor, an RTX 2060 6 GB GDDR6 graphic card, and 64 GB DDR4 of RAM. With this hardware, it is possible to apply multithreading to the used software to compute the different swarm members subprocesses at the same time.

4.1. Detection Algorithm Evaluation

The proposed approach for system testing involves evaluating different proposals for the detection algorithms and comparing their respective outcomes within a defined set of scenarios. These proposals will encompass diverse detection models for the algorithms and various configurations for the overall system. This evaluation process aims to discern the efficacy and performance of the detection and tracking system under different conditions. The following scenarios have been proposed:

Scenario 1: Using a single UAV to perform the detection of two ground vehicles moving in opposing directions. With this scenario, it is possible to test the performance of the system with a single UAV. In addition, it is possible to test classical computer vision problems like the occlusions that will happen when the two ground vehicles cross paths;
Scenario 2: Using three UAVs to perform the detection of two ground vehicles moving in opposing directions. This is an expansion of Scenario 1, allowing for a comparison of the performance of the system when more UAVs are used in the detection (see Figure 6 for examples on scenario 1 and 2);
Scenario 3: Using three UAVs to perform the detection of a single ground vehicle. This is a basic scenario to test the system in an ideal scenario without any complications for the detection;
Scenario 4: Using three UAVs to perform the detection of a single ground vehicle but performing complex maneuvers like the one presented in Figure 7. This includes a roundabout with multiple image angles for the detected vehicle or crossing and overlapping problems. This is an expansion of Scenario 3, but in this case, complex maneuvers are introduced in the path of the ground vehicle, allowing for a test in the detection of those complex maneuvers;
Scenario 5: Using three UAVs to perform the detection of three ground vehicles. This allows experimentation on the tracking system and allows us to study how the trajectories are generated from multiple sources and how the system can handle multiple vehicles, including specific generated problems like the occlusions on computer vision;
Scenario 6: Using three UAVs inside a real simulation scenario with multiple vehicles. This allows for the full SITL scenario, introducing multiple vehicles at the same time and evaluating how the tracking system can compute all trajectories without problems (see Figure 8).

All the proposed scenarios are developed to simulate real-world conditions like changes in perspectives, distance to targets, occlusions, and ground vehicle maneuverability. With these representations, the objective is to discern the efficacy and performance of the detection and tracking system by configuring swarms working with different algorithm implementations.

The initial evaluation focuses on the use of one-stage versus two-stage detection models. Specifically, YOLO (You Only Look Once) and SAM (Segment Anything Model) have been compared. Both models in their best-tested configurations exhibit strong performance in the detection stage in most experiments, with SAM demonstrating notable proficiency in detecting vehicles regardless of their distance.

However, the drawback of the SAM is the substantially increased computation time compared to the YOLO model. This comparison is illustrated in the table below, which shows both models’ performance under identical detection conditions. The scenarios considered include capturing a full image by the UAV and using a region of interest scenario where the detection area is reduced based on the prediction algorithm.

As can be seen in Table 2, the generation of masks provided by the SAM takes a considerable amount of time, making it impossible to perform real-time detection processes. This longer time situation is even aggravated when taking into account that the segmented vehicles in SAM are not classified and the second stage is needed to identify the detected vehicles.

While SAM demonstrated significant potential for the given problem, the performance penalty of using this algorithm results in a fundamental consideration when considering the implementation of the proposed system in a real-world application. This issue is exacerbated by the fact that segmented vehicles in SAM are not classified, necessitating a second stage to identify the detected vehicles. This means that using SAMs at the moment seems impossible for real-time problems that require fast processing of the UAV-captured images.

In general, the performance penalty required for the two-stage detection models implies a critical challenge for real-world applications, requiring fast processing of UAV-captured images, making it less desirable in case of real-time problems that demand rapid image processing.

This does not mean that these algorithms are not useful. It only means that the algorithm should only be used in offline scenarios when the processing time is not relevant. On the other hand, the expeditious nature of YOLO and the good result in terms of detection accuracy translate to reduced detection time and enhanced real-time tracking capabilities, thus positioning the one-stage algorithms with similar characteristics as a more pragmatic tool for decision-making and road surveillance in UAV applications.

The following table presents an in-depth examination of the different YOLO architectures tested for this study. The table showcases results for the following YOLO v8 architectures trained for detection: nano, short, medium, large, and extra-large [84].

In addition, to consider the possibility of a segmentation model that does not include a high computation time, the possibility of using a YOLO v8 medium alongside a lightweight segmentation step [85] is included. With this model, the target is to evaluate if the segmentation adds a high improvement to the YOLO model, being useful for the proposed solution.

The displayed results of Table 3 highlight two primary metrics: average processing time and average tracking success. Average processing time denotes the time required, on average, for the algorithm to execute the designated task for each image captured by a UAV during the experiment. This metric serves to evaluate the algorithm’s efficiency in producing results within the system and allows for the selection of the best implementation for a real-time tracking system. Conversely, average tracking success assesses the algorithm’s performance when integrated into the tracking system, reflecting the accuracy of vehicle detection and tracking functionalities of the system. For both metrics, the best results have been highlighted with bold letters while the worst results are indicated in gray color.

Analysis of the computed metrics reveals the expected trend: larger architectures incur bigger average processing times. In all scenarios, YOLO nano presents the shortest time, while YOLO extra-large presents the biggest. The only exception to this trend is in scenario 1, where image segmentation imposes additional computation time on the YOLO medium trained for segmentation. Furthermore, the utilization of these architectures trained for segmentation does not yield satisfactory results for this problem domain. In contrast, the YOLOv8 medium architecture trained for detection demonstrates superior performance when looking at the average tracking success, showing more effectiveness for vehicle detection tasks.

Regarding the tracking success, the analysis of scenario 1 results indicates that larger architectures yield superior performance, with the extra-large architecture achieving a tracking success rate five times higher than other architectures. However, this trend does not persist when additional UAVs are introduced into the system. Subsequent experiments that include more UAVs in the system reveal that other architectures can acquire equivalent results to YOLO extra-large. Even more, YOLO large consistently outperforms the extra-large architecture in most cases.

Also, when considering the relation between both metrics with the ATS/APT operation, the YOLO short architecture outperforms the other architectures in most scenarios, meaning that the YOLO medium is the next best candidate. This means that the improvement in the ATS metric of larger architectures is not as big as the improvement in the processing time of the smaller architectures.

Consequently, the direct inference drawn from these findings is that augmenting the number of surveillance UAVs holds greater significance than solely relying on the best tracking algorithm, meaning that the data fusion of multiple sensors in the system implies more available information and consequently better results in terms of tracking success. Nevertheless, it is important to consider that in scenarios with a single UAV, optimal results are achieved with larger architectures.

Taking into account the ATS, in scenarios involving multiple UAVs, YOLO v8 large and medium architectures emerge as the most favorable options. Generally, YOLO large is preferred for its accuracy, whereas YOLO medium excels in terms of computation time. Notably, while YOLO medium yields comparable results in simpler scenarios with a single vehicle (scenarios 1, 3, and 4), YOLO large architecture proves significantly more effective in complex scenarios featuring multiple vehicles.

However, when introducing the APT into consideration, YOLO medium outperforms YOLO large and YOLO short has the best overall results. The conclusion depends on the final use of the surveillance system, if reduced processing time is essential YOLO short seems to provide the best results but with a minor increase in the processing time the bigger architectures can provide better results.

When specifically looking at the segmentation results (see Figure 9), is possible to see that in terms of processing time the implementation without segmentation is better than the implementation with segmentation. In terms of tracking success, the segmentation implementation has less variability in its results but, in general, the no application of segmentation produces better results on this problem. This effect appears due to the errors appearing on the segmentation step (lighting, occlusions, or background clutter) that propagates to the detection algorithm, especially when including multiple vehicles of varying sizes and orientations that are not well separated by the algorithm.

These results show how a lightweight segmentation model, though can be beneficial on some occasions, can produce worse results on others, especially when including scenarios with multiple vehicles and background objects. The SAM and other advanced segmentation models can produce better results but, as already discussed, the high computational demands can make it challenging for real-time application.

Refining and advanced models with pruning techniques can produce similar results in less time. An optimal approach can be included in the fusion system, applying SAM as an ensemble method with a fast detection algorithm. In this approach, the fast detection algorithm is used by default, but a decision rule can be applied to use SAM in case the detection algorithm fails. For example, the bounding box confidence can be evaluated, or the SAM can be applied only in cases when the tracked vehicle is lost.

Also, by the inclusion of different UAVs in the system, it is possible to differentiate the tracking task. UAVs with a slower approach that uses the advanced segmentation model lose real-time capabilities and UAVs for fast detection that maintain real-time surveillance.

4.2. Swarm-Based Surveillance System Evaluation

To evaluate the use of swarms in the surveillance problem, the proposal is to assess representations of real-world scenarios with varying numbers of UAVs, considering how an increase in UAVs can enhance surveillance effectiveness.

The first scenario involves the detection of a single vehicle, serving as a baseline example of how the inclusion of UAVs can improve detection probabilities and overall surveillance output. The second scenario follows the same vehicle into a roundabout, where the detection perspectives of the UAVs can significantly change as the vehicle executes a turning maneuver.

Subsequently, additional ground vehicles will be introduced to create an occlusion scenario, where a vehicle’s view is completely obscured by another vehicle. This scenario resembles the first two but differs in that the vehicles are moving in the same direction at different speeds, thereby prolonging the occlusion duration.

This occlusion imposes restrictions on the detection capabilities of the UAV’s camera sensor, which can be mitigated through the use of additional information sources. Furthermore, the difficulty of this scenario can be enhanced by subsequently adding additional vehicles within the UAV’s detection area, including different vehicle modes to represent a real-world scenario of the detection. This allows for the testing of the surveillance of several targets simultaneously and will encompass situations involving vehicles moving in both the same and different directions, as well as multiple occlusions throughout the simulation.

These traffic scenarios, representd on Figure 10, have been designed to represent the same situation but with an increased number of vehicles, allowing the testing of how multiple ground vehicles can reduce the surveillance solution performance.

Finally, to simulate an additional real-world condition, experiments are conducted in low-light environments to evaluate the applicability of the proposed solution in nighttime scenarios.

For the results presented in Table 4, the first conclusion is that, as expected, the swarm improves the single UAV in all scenarios, and increasing the swarm members implies an improvement in the detection, although having similar results on 3 and 5 UAVs. Note that the processing time of the swarm members is averaged between all UAVs, having a bigger value if the total time is considered.

The occlusion situation provides an interesting result, the single UAV has a really bad result as the ground vehicle is completely lost from the UAV perspective while the swarms can operate better by having multiple perspectives. Table 5 presents the results of a single member of the swarm to illustrate this situation. As can be seen, these results are quite better than the single UAV mission, this is due to the reason that the single UAV is located in a bad position in comparison to this swarm member to force the occlusion scenario.

The same happens to the swarm of five UAVs being outperformed by the swarm of three UAVs. Including more UAVs from a bad perspective implies worse results than a smaller swarm with members located in the best location.

This means that including additional UAVs is not the only element to be taken into account and that the situation of the UAVs inside the mission is a relevant factor to be taken into account when configuring the mission.

On the complex scenarios, another relevant conclusion is that the low lightning conditions have little effect on the surveillance output, having as expected worse results than the base scenario but with an overall good performance. However, including more ground vehicles in the simulation have more noticeable effects. This problem appears due to additional false positive detections that appear when a vehicle is detected by different UAVs but not associated with a single global trajectory.

To illustrate these results, the following Figure 11 shows a heatmap of the ATS metric, across all different scenarios with the tested swarms of one, three, and five UAVs. As can be seen, the proposed swarm system maintains the performance of ATS in all different conditions proposed in the scenarios, making it possible to appreciate the benefit of using more UAVs to perform surveillance tracking.

To compare the results of the YOLO algorithm, the RT-DETR algorithm is proposed as an alternative one-stage model capable of obtaining high-accuracy results on real-time processing. The following Table 6 shows the same experiments applied for the swarm with the YOLO algorithm but modifying the implementation to use the RT-DETR algorithm.

With reference to the RT-DETR results, the conclusions are similar; the inclusion of additional UAVs improves the surveillance output, but the UAV perspective remains an important factor. Being the main advantage of the use of swarms the inclusion of multiple perspectives at the same time.

To compare both algorithm’s results, the following Figure 12 shows the ATS and APT values for each experiment. As can be observed, YOLO output has slightly better accuracy but a much-reduced computation time, obtaining a much better ATS/ATP result.

In addition, as expected, the single-vehicle detection computes the overall best results and the addition of real-world situations to the problem produces more errors in the detection, reducing the ATS value and being the most problematic element the occlusions produced by terrain elements or multiple cars in the scenario. This means that, for real-world problems, it is essential to configure the mission to provide different perspectives to obtain useful results. The main advantage of the swarm is the capability to include multiple perspectives in the solution.

To evaluate the swarm effect, the following Figure 13 summarizes the metric results for one, three, and five UAVs. As can be seen for both algorithms, the average processing time behaves quite constant when adding more UAVs. The average tracking success improves when adding more UAVs, although it is possible to see how the results tend to plateau. There is more improvement between one and three UAVs than between three and five UAVs. This means that including more UAVs is not always the best solution as it is possible to include bad perspectives in the solution that will affect the average success value.

The swarm still provides better results than a single UAV mission, but the configuration of the mission is more important than the inclusion of a high number of UAVs. To improve this solution, an advanced fusion system can be included to perform a selection of UAV detections, discarding all local detections that provide wrong information.

Finally, to compare all results, the following Table 7 shows the performance of the fast detection algorithms on a single vehicle detection case against the fastest detection achieved with SAM, using the reduced area image. As it can be seen, the accuracy of SAM is better but with the use of swarms, the faster algorithms achieve comparable results in a much more suitable time for the real-time surveillance problem.

5. Conclusions

The experiments conducted in this study are aimed at the evaluation of two main factors. First, the comparison between different detection models and architectures for vehicle tracking in UAV-based surveillance systems. Second, an evaluation of the swarm’s usefulness in different real-world situations to demonstrate how the use of a swarm can improve the overall results of a single UAV, being to use fast detection algorithms instead of high precision ones. Also, an evaluation was conducted regarding the utilization of one-stage or two-stage detection models. The assessment revealed that while the accuracy of algorithms can be enhanced with more specialized two-stage models, the additional computation time required renders them unsuitable for real-time solutions. Subsequently, the usefulness of the swarm approach can be demonstrated through the application of a one-stage model. Within the evaluated YOLO architectures, the conclusion is that larger models yield better accuracy results than smaller ones; however, the smaller models outperform the larger ones in terms of processing time. This indicates that a swarm approach utilizing a smaller architecture is essential for achieving improved accuracy, alongside reduced processing times. Furthermore, the results of experiments made with YOLO and RT-DETR indicate a clear enhancement in performance when employing multiple UAVs for the same mission. An important consideration is that the swarm mission configuration must be tailored to the specific use case to achieve optimal results.

Future steps for this study can be pursued in several directions. First, due to the suboptimal results observed in high-traffic scenarios, enhancements can be made to the overall tracking system, including the integration of more specialized algorithms designed to improve performance in multi-vehicle scenarios. In addition, a decentralized swarm approach can be studied to overcome possible real-world challenges that can appear from a centralized swarm approach. Secondly, a comprehensive evaluation of various fast detection algorithms can be conducted to identify the most effective approach for UAV swarm-based surveillance. This evaluation could include additional YOLO models as well as alternatives such as EfficientDet or CenterNet. Finally, a significant advancement would be to implement the next testing phases, evaluating the system within a Hardware-In-The-Loop (HITL) framework to account for hardware limitations and in a Virtual-In-The-Loop (VITL) simulation to assess mission effectiveness in a real-world environment.

Author Contributions

Conceptualization, D.S.P. and D.A.; data curation, D.S.P., P.Z. and D.A.; formal analysis, D.S.P. and D.A.; funding acquisition, J.G. and J.M.M.; investigation, D.S.P., D.A. and P.Z.; methodology, D.S.P. and D.A.; project administration, J.G. and J.M.M.; resources, D.S.P., D.A., J.G. and J.M.M.; software, D.S.P., P.Z. and D.A.; supervision, J.G. and J.M.M.; validation, D.S.P., D.A., J.G. and J.M.M.; visualization, D.S.P. and D.A.; writing—original draft preparation, D.S.P.; writing—review and editing, D.S.P., J.G. and J.M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by public research projects of the Spanish Ministry of Science and Innovation PID2020-118249RB-C22 and PID2023-151605OB-C22 and the project under the call PEICTI 2021-2023 with identifier TED2021-131520B-C22.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

The results presented in this paper have been possible thanks to the work performed by degree student Mario Maroto Jiménez.

Conflicts of Interest

The authors declare no conflicts of interest.

References

García, J.; Molina, J.M.; Amigo, D.; Llerena, J.P.; Sánchez Pedroche, D. ENGINEERING UAS APPLICATIONS Sensor Fusion, Machine Vision and Mission; Artech House Publishers s.l.: Norwood, MA, USA, 2023; ISBN 978-1-63081-984-2. Available online: https://ieeexplore.ieee.org/servlet/opac?bknumber=10303007 (accessed on 12 November 2024).
Tahir, A.; Böling, J.; Haghbayan, M.-H.; Toivonen, H.T.; Plosila, J. Swarms of Unmanned Aerial Vehicles—A Survey. J. Ind. Inf. Integr. 2019, 16, 100106. [Google Scholar] [CrossRef]
Zhou, Y.; Rao, B.; Wang, W. UAV Swarm Intelligence: Recent Advances and Future Trends. IEEE Access 2020, 8, 183856–183878. [Google Scholar] [CrossRef]
Nigam, N.; Bieniawski, S.; Kroo, I.; Vian, J. Control of Multiple UAVs for Persistent Surveillance: Algorithm and Flight Test Results. IEEE Trans. Control Syst. Technol. 2012, 20, 1236–1251. [Google Scholar] [CrossRef]
Keller, J.; Thakur, D.; Dobrokhodov, V.; Jones, K.; Pivtoraiko, M.; Gallier, J.; Kaminer, I.; Kumar, V. A Computationally Efficient Approach to Trajectory Management for Coordinated Aerial Surveillance. Unmanned Syst. 2013, 1, 59–74. [Google Scholar] [CrossRef]
Bourne, J.R.; Goodell, M.N.; He, X.; Steiner, J.A.; Leang, K.K. Decentralized Multi-agent information-theoretic control for target estimation and localization: Finding gas leaks. Int. J. Robot. Res. 2020, 39, 1525–1548. [Google Scholar] [CrossRef]
Mahadevkar, S.V.; Khemani, B.; Patil, S.; Kotecha, K.; Vora, D.R.; Abraham, A.; Gabralla, L.A. A Review on Machine Learning Styles in Computer Vision—Techniques and Future Directions. IEEE Access 2022, 10, 107293–107329. [Google Scholar] [CrossRef]
Clavero, C.; Patricio, M.A.; García, J.; Molina, J.M. DMZoomNet: Improving Object Detection Using Distance Information in Intralogistics Environments. IEEE Trans. Ind. Inform. 2024, 20, 9163–9171. [Google Scholar] [CrossRef]
Bernardin, K.; Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Malligere Shivanna, V.; Guo, J.-I. Object Detection, Recognition, and Tracking Algorithms for ADASs—A Study on Recent Trends. Sensors 2023, 24, 249. [Google Scholar] [CrossRef]
Shahian Jahromi, B.; Tulabandhula, T.; Cetin, S. Real-Time Hybrid Multi-Sensor Fusion Framework for Perception in Autonomous Vehicles. Sensors 2019, 19, 4357. [Google Scholar] [CrossRef]
Xiao, J.; Cheng, H.; Sawhney, H.; Han, F. Vehicle detection and tracking in wide field-of-view aerial video. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 679–684. [Google Scholar]
Cao, Y.; Wang, G.; Yan, D.; Zhao, Z. Two Algorithms for the Detection and Tracking of Moving Vehicle Targets in Aerial Infrared Image Sequences. Remote Sens. 2015, 8, 28. [Google Scholar] [CrossRef]
Kumar, S.; Singh, S.K.; Varshney, S.; Singh, S.; Kumar, P.; Kim, B.-G.; Ra, I.-H. Fusion of Deep Sort and Yolov5 for Effective Vehicle Detection and Tracking Scheme in Real-Time Traffic Management Sustainable System. Sustainability 2023, 15, 16869. [Google Scholar] [CrossRef]
Kim, J.-H.; Kim, N.; Park, Y.W.; Won, C.S. Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset. J. Mar. Sci. Eng. 2022, 10, 377. [Google Scholar] [CrossRef]
Gómez-Romero, J.; Serrano, M.A.; García, J.; Molina, J.M.; Rogova, G. Context-based multi-level information fusion for harbor surveillance. Inf. Fusion 2015, 21, 173–186. [Google Scholar] [CrossRef]
Amigo, D.; Sánchez, D.; García, J.; Molina, J.M. Segmentation Optimization in Trajectory-Based Ship Classification. In Proceedings of the 15th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2020), Burgos, Spain, 16–18 September 2020; Springer International Publishing: Cham, Switzerland, 2021; pp. 540–549. [Google Scholar]
Sánchez Pedroche, D.; García Herrero, J.; Molina López, J.M. Context learning from a ship trajectory cluster for anomaly detection. Neurocomputing 2024, 563, 126920. [Google Scholar] [CrossRef]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small Object Detection on Unmanned Aerial Vehicle Perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef]
Hu, Y.; Wu, X.; Zheng, G.; Liu, X. Object Detection of UAV for Anti-UAV Based on Improved YOLO v3. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8386–8390. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
Hou, X.; Wang, Y.; Chau, L.-P. Vehicle Tracking Using Deep SORT with Low Confidence Track Filtering. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Mandal, V.; Mussah, A.R.; Jin, P.; Adu-Gyamfi, Y. Artificial Intelligence-Enabled Traffic Monitoring System. Sustainability 2020, 12, 9177. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10778–10787. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. arXiv 2019, arXiv:1904.08189. [Google Scholar] [CrossRef]
Kuswantori, A.; Suesut, T.; Tangsrirat, W.; Schleining, G.; Nunak, N. Fish Detection and Classification for Automatic Sorting System with an Optimized YOLO Algorithm. Appl. Sci. 2023, 13, 3812. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; pp. 833–851. ISBN 978-3-030-01233-5. [Google Scholar]
Hafiz, A.M.; Bhat, G.M. A survey on instance segmentation: State of the art. Int. J. Multimed. Inf. Retr. 2020, 9, 171–189. [Google Scholar] [CrossRef]
Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic Segmentation. arXiv 2018, arXiv:1801.00868. [Google Scholar] [CrossRef]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-Net: Towards Unified Image Segmentation. arXiv 2021, arXiv:2106.14855. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. arXiv 2021, arXiv:2112.01527. [Google Scholar] [CrossRef]
Xu, N.; Price, B.; Cohen, S.; Yang, J.; Huang, T. Deep Interactive Object Selection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Castrejon, L.; Kundu, K.; Urtasun, R.; Fidler, S. Annotating Object Instances with a Polygon-RNN. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [CrossRef]
Cheng, H.K.; Oh, S.W.; Price, B.; Schwing, A.; Lee, J.-Y. Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1316–1326. [Google Scholar]
Zhang, D.; Liang, D.; Yang, H.; Zou, Z.; Ye, X.; Liu, Z.; Bai, X. SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model. arXiv 2023, arXiv:2306.02245. [Google Scholar] [CrossRef]
Ji, W.; Li, J.; Bi, Q.; Liu, T.; Li, W.; Cheng, L. Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications. Mach. Intell. Res. 2024, 21, 617–630. [Google Scholar] [CrossRef]
Zhang, C.; Liu, L.; Cui, Y.; Huang, G.; Lin, W.; Yang, Y.; Hu, Y. A Comprehensive Survey on Segment Anything Model for Vision and Beyond. arXiv 2023, arXiv:2305.08196. [Google Scholar] [CrossRef]
Yeom, S.; Nam, D.-H. Moving Vehicle Tracking with a Moving Drone Based on Track Association. Appl. Sci. 2021, 11, 4046. [Google Scholar] [CrossRef]
Hossain, S.; Lee, D.J. Deep Learning-Based Real-Time Multiple-Object Detection and Tracking from Aerial Imagery via a Flying Robot with GPU-Based Embedded Devices. Sensors 2019, 19, 3371. [Google Scholar] [CrossRef] [PubMed]
Yazid, Y.; Ez-Zazi, I.; Guerrero-González, A.; El Oualkadi, A.; Arioua, M. UAV-Enabled Mobile Edge-Computing for IoT Based on AI: A Comprehensive Review. Drones 2021, 5, 148. [Google Scholar] [CrossRef]
Llerena Caña, J.P.; García Herrero, J.; Molina López, J.M. Error Reduction in Vision-Based Multirotor Landing System. Sensors 2022, 22, 3625. [Google Scholar] [CrossRef]
Rahman, Z.; Ami, A.M.; Ullah, M.A. A Real-Time Wrong-Way Vehicle Detection Based on YOLO and Centroid Tracking. In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 916–920. [Google Scholar]
Wei, W.; Gao, K.; Zhang, L. A Deep Learning-Based Study on Multiple Object Tracking of Small Targets with Application to Mice Tracking. In Proceedings of the 2023 16th International Conference on Advanced Computer Theory and Engineering (ICACTE), Hefei, China, 15–17 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar]
Zhang, F.; Li, C.; Yang, F. Vehicle Detection in Urban Traffic Surveillance Images Based on Convolutional Neural Networks with Feature Concatenation. Sensors 2019, 19, 594. [Google Scholar] [CrossRef]
Velazquez-Pupo, R.; Sierra-Romero, A.; Torres-Roman, D.; Shkvarko, Y.; Santiago-Paz, J.; Gómez-Gutiérrez, D.; Robles-Valdez, D.; Hermosillo-Reynoso, F.; Romero-Delgado, M. Vehicle Detection with Occlusion Handling, Tracking, and OC-SVM Classification: A High Performance Vision-Based System. Sensors 2018, 18, 374. [Google Scholar] [CrossRef]
De Ponte Müller, F. Survey on Ranging Sensors and Cooperative Techniques for Relative Positioning of Vehicles. Sensors 2017, 17, 271. [Google Scholar] [CrossRef]
Kim, T.; Park, T.-H. Extended Kalman Filter (EKF) Design for Vehicle Position Tracking Using Reliability Function of Radar and Lidar. Sensors 2020, 20, 4126. [Google Scholar] [CrossRef] [PubMed]
Song, H.; Liang, H.; Li, H.; Dai, Z.; Yun, X. Vision-based vehicle detection and counting system using deep learning in highway scenes. Eur. Transp. Res. Rev. 2019, 11, 51. [Google Scholar] [CrossRef]
Liu, X.; Zhang, Z. A Vision-Based Target Detection, Tracking, and Positioning Algorithm for Unmanned Aerial Vehicle. Wirel. Commun. Mob. Comput. 2021, 2021, 5565589. [Google Scholar] [CrossRef]
Yan, B.; Fan, P.; Lei, X.; Liu, Z.; Yang, F. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Mazor, E.; Averbuch, A.; Bar-Shalom, Y.; Dayan, J. Interacting multiple model methods in target tracking: A survey. IEEE Trans. Aerosp. Electron. Syst. 1998, 34, 103–123. [Google Scholar] [CrossRef]
Sánchez Pedroche, D.; Amigo, D.; García, J.; Molina, J.M. Architecture for Trajectory-Based Fishing Ship Classification with AIS Data. Sensors 2020, 20, 3782. [Google Scholar] [CrossRef]
Del Moral, P. Nonlinear filtering: Interacting particle resolution. Comptes Rendus Académie Sci.-Ser.-Math. 1997, 325, 653–658. [Google Scholar] [CrossRef]
Yang, D.; Ozbay, K.; Xie, K.; Yang, H.; Zuo, F.; Sha, D. Proactive safety monitoring: A functional approach to detect safety-related anomalies using unmanned aerial vehicle video data. Transp. Res. Part C Emerg. Technol. 2021, 127, 103130. [Google Scholar] [CrossRef]
Bouassida, S.; Neji, N.; Nouvelière, L.; Neji, J. Evaluating the Impact of Drone Signaling in Crosswalk Scenario. Appl. Sci. 2020, 11, 157. [Google Scholar] [CrossRef]
Yilmaz, A.; Javed, O.; Shah, M. Object tracking: A survey. ACM Comput. Surv. 2006, 38, 13. [Google Scholar] [CrossRef]
Xiao, J.; Cheng, H.; Han, F.; Sawhney, H. Geo-spatial aerial video processing for scene understanding and object tracking. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–8. [Google Scholar]
Perera, A.G.A.; Srinivas, C.; Hoogs, A.; Brooksby, G.; Hu, W. Multi-Object Tracking Through Simultaneous Long Occlusions and Split-Merge Conditions. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 1, pp. 666–673. [Google Scholar]
Valappil, N.K.; Memon, Q.A. Vehicle Detection in UAV Videos Using CNN-SVM. In Advances in Intelligent Systems and Computing, Proceedings of the 12th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2020), Online, 15–18 December 2020; Abraham, A., Ohsawa, Y., Gandhi, N., Jabbar, M.A., Haqiq, A., McLoone, S., Issac, B., Eds.; Springer International Publishing: Cham, Switzerland, 2021; Volume 1383, pp. 221–232. ISBN 978-3-030-73688-0. [Google Scholar]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Ma, S.; Duan, S.; Hou, Z.; Yu, W.; Pu, L.; Zhao, X. Multi-object tracking algorithm based on interactive attention network and adaptive trajectory reconnection. Expert Syst. Appl. 2024, 249, 123581. [Google Scholar] [CrossRef]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. arXiv 2019, arXiv:1901.07518. [Google Scholar] [CrossRef]
Zheng, Q.; Chen, Y. Feature pyramid of bi-directional stepped concatenation for small object detection. Multimed. Tools Appl. 2021, 80, 20283–20305. [Google Scholar] [CrossRef]
Singh, B.; Davis, L.S. An Analysis of Scale Invariance in Object Detection—SNIP. arXiv 2017, arXiv:1711.08189. [Google Scholar] [CrossRef]
Guo, C.; Cai, M.; Ying, N.; Chen, H.; Zhang, J.; Zhou, D. ANMS: Attention-based non-maximum suppression. Multimed. Tools Appl. 2022, 81, 11205–11219. [Google Scholar] [CrossRef]
Bratman, M. Intention, Plans, and Practical Reason; Harvard University Press: Cambridge, MA, USA, 1987. [Google Scholar]
Ferber, J. Multi-Agent Systems: An Introduction to Distributed Artificial Intelligence, 1st ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1999; ISBN 978-0-201-36048-6. [Google Scholar]
Oh, K.-K.; Park, M.-C.; Ahn, H.-S. A survey of multi-agent formation control. Automatica 2015, 53, 424–440. [Google Scholar] [CrossRef]
Saif, O.; Fantoni, I.; Zavala-Río, A. Distributed integral control of multiple UAVs: Precise flocking and navigation. IET Control Theory Appl. 2019, 13, 2008–2017. [Google Scholar] [CrossRef]
De Frias, C.J.; Al-Kaff, A.; Moreno, F.M.; Madridano, A.; Armingol, J.M. Intelligent Cooperative System for Traffic Monitoring in Smart Cities. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 33–38. [Google Scholar]
Roldan, J.J.; Garcia-Aunon, P.; Pena-Tapia, E.; Barrientos, A. SwarmCity Project: Can an Aerial Swarm Monitor Traffic in a Smart City? In Proceedings of the 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), Kyoto, Japan, 11–15 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 862–867. [Google Scholar]
Garcia-Aunon, P.; Roldán, J.J.; Barrientos, A. Monitoring traffic in future cities with aerial swarms: Developing and optimizing a behavior-based surveillance algorithm. Cogn. Syst. Res. 2019, 54, 273–286. [Google Scholar] [CrossRef]
SIMBAT Project. Available online: https://giaa.uc3m.es/simbat-project/ (accessed on 12 November 2024).
Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision. arXiv 2024, arXiv:2409.08475. [Google Scholar] [CrossRef]
Premaratne, P.; Jawad Kadhim, I.; Blacklidge, R.; Lee, M. Comprehensive review on vehicle Detection, classification and counting on highways. Neurocomputing 2023, 556, 126627. [Google Scholar] [CrossRef]
YOLO v8 Detection Models. Available online: https://docs.ultralytics.com/tasks/detect/ (accessed on 12 November 2024).
YOLO v8 Segmentation Models. Available online: https://docs.ultralytics.com/tasks/segment/ (accessed on 12 November 2024).

Figure 1. Detection algorithm applied from different perspectives of the UAVs.

Figure 2. Details of UAV mission.

Figure 3. System overview.

Figure 4. Detection and tracking process.

Figure 5. Trajectory data-fusion process.

Figure 6. Scenarios 1 and 2 example of crossing vehicles from the perspective of a single UAV.

Figure 7. Scenario 4 example of a vehicle moving through the roundabout and changing the UAV perspective.

Figure 8. Scenario 6 from the perspective of UAV 1. Each line represent the performed detections for each UAV from the presented perspective on each vehicle.

Figure 9. Comparison of results of applying a segmentation algorithm.

Figure 10. Example of different perspectives in a scenario with multiple vehicles.

Figure 11. ATS heatmap for different scenarios and number of UAVs.

Figure 12. Values comparison for YOLO and RT-DETR.

Figure 13. Box-plot over experiments with 1, 3, and 5 UAVs for YOLO and RT_DETR.

Table 1. Confusion matrix of the detections.

True Positive:	False Positive:
If the system detects a ground vehicle and the ground truth confirms a vehicle in an area near enough to the georeferenced position	If the system detects a ground vehicle but there is no vehicle confirmed
True negative:	False negative:
If the system does not detect anything and there is no vehicle in the image area	If the system does not detect anything but there is a vehicle in the image area
Successful detections	Unsuccessful detections

Table 2. Example of performance of YOLO vs. SAM.

Scenario	YOLO	SAM
Full image processing	0.1947 s	12.690 s
256 × 256 portion	0.023 s	6.623 s

Table 3. YOLOv8 architectures comparison.

Scenario	Architecture	Average Tracking Success	Average Processing Time	ATS/APT
1	YOLOv8 nano	0.0000	0.0725	0
	YOLOv8 short	0.0000	0.0920	0
	YOLOv8 medium	0.2	0.1350	1.482
	YOLOv8 large	0.1212	0.1730	0.7006
	YOLOv8 extra large	0.4098	0.2330	1.759
	YOLOv8 medium segmentation	0.16666	0.2620	0.6361
2	YOLOv8 nano	0.3953	0.0860	4.597
	YOLOv8 short	0.7353	0.0909	8.0890
	YOLOv8 medium	0.7018	0.1112	6.3107
	YOLOv8 large	0.7362	0.1200	6.135
	YOLOv8 extra large	0.7647	0.1587	4.8186
	YOLOv8 medium segmentation	0.6666	0.1146	5.8173
3	YOLOv8 nano	0.1818	0.0833	2.1827
	YOLOv8 short	0.8	0.0890	8.9888
	YOLOv8 medium	0.8382	0.1031	8.1303
	YOLOv8 large	0.9259	0.1207	7.6713
	YOLOv8 extra large	0.8039	0.1517	5.2994
	YOLOv8 medium segmentation	0.6923	0.1098	6.3052
4	YOLOv8 nano	0.0000	0.0781	0
	YOLOv8 short	0.4865	0.0977	4.9794
	YOLOv8 medium	0.7895	0.1110	7.1124
	YOLOv8 large	0.86184	0.1287	6.6965
	YOLOv8 extra large	0.8047	0.1629	4.94
	YOLOv8 medium segmentation	0.6730	0.1147	5.8682
5	YOLOv8 nano	0.5455	0.0930	5.8651
	YOLOv8 short	0.7895	0.0963	7.3619
	YOLOv8 medium	0.757	0.1168	6.4811
	YOLOv8 large	0.7366	0.1453	5.0698
	YOLOv8 extra large	0.7785	0.1650	4.7185
	YOLOv8 medium segmentation	0.7326	0.1237	5.9223
6	YOLOv8 nano	0.0000	0.0852	0
	YOLOv8 short	0.6916	0.0963	7.1824
	YOLOv8 medium	0.8077	0.1130	7.1477
	YOLOv8 large	0.817	0.1270	6.4332
	YOLOv8 extra large	0.8845	0.1596	5.5417
	YOLOv8 medium segmentation	0.7085	0.1179	6.0096

Bold cells mark the best case and gray cells mark the worst.

Table 4. Swarm configuration evaluation for YOLO algorithm.

Evaluation	Swarm Configuration	Average Tracking Success	Average Processing Time	ATS/APT
Single vehicle detection	1 UAV	0.6667	0.127	5.2508
	3 UAVs	0.8022	0.1039	7.7176
	5 UAVs	0.8363	0.1048	7.9816
Roundabout	1 UAV	0.4839	0.1154	4.1927
	3 UAVs	0.7029	0.1077	6.5282
	5 UAVs	0.8058	0.1108	7.2752
Occlusion scenario	1 UAV	0	0.1013	0
	3 UAVs	0.6898	0.1075	6.415
	5 UAVs	0.6186	0.1193	5.1838
Low light scenario	1 UAV	0.5	0.0985	5.076
	3 UAVs	0.7078	0.1068	6.232
	5 UAVs	0.8193	0.1045	7.839
Medium traffic scenario	1 UAV	0.6789	0.2071	2.1078
	3 UAVs	0.6488	0.1807	3.5913
	5 UAVs	0.7573	0.1374	5.5100
High traffic scenario	1 UAV	0.4851	0.1922	2.5239
	3 UAVs	0.6494	0.1196	5.4303
	5 UAVs	0.6808	0.1201	5.6696

Bold cells mark the best case and gray cells mark the worst.

Table 5. Results for a single swarm member in the occlusion scenario.

Average Tracking Success	Average Processing Time	ATS/APT
0.5361	0.1234	4.343

Table 6. Swarm configuration evaluation for RT-DETR algorithm.

Evaluation	Swarm Configuration	Average Tracking Success	Average Processing Time	ATS/APT
Single vehicle detection	1 UAV	0.3793	0.2543	1.4914
	3 UAVs	0.6893	0.3079	2.2386
	5 UAVs	0.7734	0.2898	2.6685
Roundabout	1 UAV	0.514	0.2946	1.7445
	3 UAVs	0.5708	0.2886	1.9777
	5 UAVs	0.7148	0.2871	1.3906
Occlusion scenario	1 UAV	0.0000	0.2535	0.0000
	3 UAVs	0.4533	0.2678	1.3273
	5 UAVs	0.5558	0.2654	1.7590
Low light scenario	1 UAV	0.0455	0.2946	0.1543
	3 UAVs	0.4542	0.2886	1.5735
	5 UAVs	0.6935	0.2871	2.4154
Medium traffic scenario	1 UAV	0.4435	0.3472	1.2775
	3 UAVs	0.5251	0.3438	1.5273
	5 UAVs	0.6455	0.2678	2.4108
High traffic scenario	1 UAV	0.3276	0.2977	1.1006
	3 UAVs	0.4149	0.2626	1.5798
	5 UAVs	0.5679	0.4605	1.2332

Bold cells mark the best case and gray cells mark the worst.

Table 7. Differences in the precision of texted algorithms in single vehicle detection.

Algorithm	ATS	APT
YOLOv8 medium (5 UAVs)	0.8363	0.1048
RT-DETR (5 UAVs)	0.7734	0.2898
SAM (1 UAV)	0.933	6.623

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sánchez Pedroche, D.; Amigo, D.; García, J.; Molina, J.M.; Zubasti, P. Drone Swarm for Distributed Video Surveillance of Roads and Car Tracking. Drones 2024, 8, 695. https://doi.org/10.3390/drones8110695

AMA Style

Sánchez Pedroche D, Amigo D, García J, Molina JM, Zubasti P. Drone Swarm for Distributed Video Surveillance of Roads and Car Tracking. Drones. 2024; 8(11):695. https://doi.org/10.3390/drones8110695

Chicago/Turabian Style

Sánchez Pedroche, David, Daniel Amigo, Jesús García, José M. Molina, and Pablo Zubasti. 2024. "Drone Swarm for Distributed Video Surveillance of Roads and Car Tracking" Drones 8, no. 11: 695. https://doi.org/10.3390/drones8110695

APA Style

Sánchez Pedroche, D., Amigo, D., García, J., Molina, J. M., & Zubasti, P. (2024). Drone Swarm for Distributed Video Surveillance of Roads and Car Tracking. Drones, 8(11), 695. https://doi.org/10.3390/drones8110695

Article Menu

Drone Swarm for Distributed Video Surveillance of Roads and Car Tracking

Abstract

1. Introduction

2. Related Work

2.1. Vehicle Detection and Tracking

2.2. Swarm Deployment and Configuration

3. System

3.1. Swarm System Overview

UAV Control and Simulation

3.2. Vehicle Detection

3.3. Vehicle Tracking

3.3.1. Georeferencing

3.3.2. Trajectory Data Fusion

3.4. Real-World Challenges

4. Test and Results

4.1. Detection Algorithm Evaluation

4.2. Swarm-Based Surveillance System Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI