Next Article in Journal / Special Issue
A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers
Previous Article in Journal
L1 Adaptive Control for Small-Scale Unmanned Helicopters: Enhancing Speed Regulation
Previous Article in Special Issue
YOLO-DroneMS: Multi-Scale Object Detection Network for Unmanned Aerial Vehicle (UAV) Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation

1
Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
2
Department of Mechanical Engineering, University of Victoria, Victoria, BC V8P 5C2, Canada
3
IDMEC, Mechanical Engineering Institute, Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
*
Author to whom correspondence should be addressed.
Drones 2024, 8(11), 650; https://doi.org/10.3390/drones8110650
Submission received: 17 October 2024 / Revised: 3 November 2024 / Accepted: 4 November 2024 / Published: 6 November 2024
(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones, 2nd Edition)

Abstract

:
Given the recent proliferation of Unmanned Aerial Systems (UASs) and the consequent importance of counter-UASs, this project aims to perform the detection and tracking of small non-cooperative UASs using Electro-optical (EO) and Infrared (IR) sensors. Two data integration techniques, at the decision and pixel levels, are compared with the use of each sensor independently to evaluate the system robustness in different operational conditions. The data are submitted to a YOLOv7 detector merged with a ByteTrack tracker. For training and validation, additional efforts are made towards creating datasets of spatially and temporally aligned EO and IR annotated Unmanned Aerial Vehicle (UAV) frames and videos. These consist of the acquisition of real data captured from a workstation on the ground, followed by image calibration, image alignment, the application of bias-removal techniques, and data augmentation methods to artificially create images. The performance of the detector across datasets shows an average precision of 88.4%, recall of 85.4%, and [email protected] of 88.5%. Tests conducted on the decision-level fusion architecture demonstrate notable gains in recall and precision, although at the expense of lower frame rates. Precision, recall, and frame rate are not improved by the pixel-level fusion design.

1. Introduction

Aside from the advantages an Unmanned Aerial System (UAS) offers in a variety of military and civilian applications, the number of records of malicious activity caused by the lack of regularization, negligence, or criminal intent has been growing recently [1]. As a result, there has been a significant increase in efforts dedicated to the research and development of Counter-UAS (C-UAS) frameworks capable of countering threats posed by hostile UASs, and improving safety, security, and privacy.
The architecture of a C-UAS system can essentially be separated into two main components [2]. The perception of the Unmanned Aerial Vehicle (UAV) is first achieved through the sensing step, which uses one or more sensors to detect and distinguish it from other objects. This is followed by the classification step, which verifies the object and identifies it to ascertain whether it is malicious or illegal [3]. Then, localization in terms of the relative position and velocity is conducted, as well as the tracking of the target in order to follow its trajectory. The final step is neutralization which involves the mitigation of the threat by interfering with or disabling it. These systems can be categorized as electronic or kinetic–mechanical neutralizers. Radio Frequency (RF) jamming, GNSS jamming, and spoofing are some examples of electronic neutralizers that use electronic waves to disrupt the performance of the target. Kinetic–mechanical techniques are based on the physical interception of the UAV through the use, for instance, of projectiles, collision UAVs, or nets [4].

Related Work

Several distinct approaches to UAV detection have been introduced over the years, as explored in [5]. The most frequently used sensors are RADAR, LiDAR, Electro-optical (EO) and Infrared (IR) imaging cameras, RF based, and acoustic. Each of these sensors has been widely utilized for the detection of objects for different applications, not necessarily for UAVs due to their small size constraint. The main characteristics of each sensor, focused on the detection of small objects, are described as follows:
  • RADAR: RADAR transmits electromagnetic waves that are reflected by targets, with a frequency range from 3 MHz to 300 GHz. The interpretation of the reflected rays may determine the position and velocity of the objects. A significant issue with the use of RADAR for the detection of UAVs is raised by their low RADAR cross-section area, that might make them undetectable. However, because RADAR presents a high robustness to weather and lighting conditions, research on micro-Doppler signature-based methods has been conducted [6]. In [7], the micro-Doppler effect for frequency-modulated continuous-wave (FMCW) RADAR applications is modelled, showing high confidence rates for UAV class identification based on the number of UAV motors. In [8], an X-band pulse-Doppler RADAR is used to compare the RADAR signatures of fixed-wing UAVs with only puller blades, multirotor UAVs with only lifting blades, and VTOL UAVs with both lifting and puller blades, which can help identify UAV types.
  • LiDAR: LiDAR shares the working principle of RADAR, although having a higher frequency range, from 200 THz to 400 THz. It is also able to issue a 3D map of the environment. Although it is less robust to weather, it is still a valuable tool to initialize the position of the object in a detection system. In [9], a probabilistic analysis of the detection of small UAVs in various scenarios is proposed.
  • RF sensor: These passive sensors capture the signals used by a target to communicate with the ground, making it possible to detect, locate and also, in some cases, identify the aircraft. Apart from being robust to weather and lighting, one important feature of RF sensors is the possibility of detecting the controller on the ground, which is relevant in countering a threat. In [10], spectral–temporal localization and classification of the RF signals of UAVs with a visual object detector approach are performed. This shows promising results, even in noise interference situations, by processing the spectrograms using the YOLOv5 object detector. In [11], a novel RF signal image representation scheme that incorporates a convolutional neural network (CNN) is implemented to perform UAV classification, achieving high classification accuracy scores of 98.72% and 98.67% on two different datasets.
  • Acoustic sensor: An acoustic sensor can detect, distinguish, and identify the sound emitted by the engine and propellers of a UAV. By using a specific arrangement of multiple microphones, the estimation of the azimuth and elevation of one or more UAVs is possible. However, this sensor presents some limitations in terms of the detection range and accuracy, and susceptibility to background noise interference, even though it is a low-cost and accessible tool. In [12], good performance results of the detection and localization of small UAVs are achieved by using an acoustic-based surveillance system. A UAV detection system based on acoustic signatures that are fed to two machine learning models, a Random Forest and Multilayer Perceptron (MLP), is proposed [13]. The MLP model was considered a better solution for the detection of complex and nonlinear acoustic features.
  • EO camera: An EO camera allows for the detection of objects by capturing the light reflected by them. Although being intuitive for interpretation, and capable of providing detailed information on the surrounding environment, these sensors present low robustness to low lighting scenes, namely at night, and weather conditions, such as rain and fog. For visual object detection, different computer vision algorithms based on Deep Learning (DL) models have been developed. A comparison between 14 object detectors with the proposed visual UAV dataset is conducted in [14]. This makes conclusions based on the performance and processing time. In [15], a detection and tracking system for small UAVs using a DL framework that performs image alignment is proposed. Results with high-resolution images show a track probability of more than 95% up to 700 m. The problem of distinguishing small UAVs from birds is addressed in [16]. It concludes that object detectors benefit from being trained with datasets that include various UAVs and birds in order to decrease the number of False Positive (FP) detections on inference.
  • IR sensor: IR sensors, that capture the thermal signature of objects, are much explored in the military sector, and are more adequate for some challenging scenarios. Particularly using Long-Wave Infrared (LWIR) sensors, the thermal signatures emitted by the batteries of UAVs can be detected. These sensors typically have a lower imaging resolution and higher granularity, which limit their use independently. These issues that result in a lack of texture and feature highlight are addressed in [17]. An improved detector that drops low-resolution layers and enhances high-resolution layers is proposed. It also includes a multi-frame filtering stage that consists of an adaptive pipeline filter (APF) to reduce the FP rate, achieving a precision of more than 95%. Promising results in small-UAV detection using an IR sensor are achieved in [18], by learning the nonlinear mapping from an input image to the residual image and highlighting the target UAV by subtracting these images.
In particular for vision-based object detection, which is one of the most relevant tasks of computer vision, DL methods are commonly exploited [19]. The object detectors take an image as input and output the bounding boxes of the detected objects with the corresponding labels. In turn, multi-object trackers provide continuity in time by connecting sets of detections across time, by assigning an ID to each one, without any previous knowledge of their location.
To either tackle different problems or to enhance performance and results, the usage of more than one sensor is advantageous. For imaging sensors, decision-level and pixel-level data fusion approaches have shown promising results for various applications. In [20], a decision-level approach that bases the detection on the sensor with the highest confidence score at the output stage is developed. In [21], context-aware fusion at pixel level is performed after image segmentation for traffic surveillance applications. In this case, a merged image from the output of both cameras is created to be detected. One of its most relevant challenges is the requirement for spatially and temporally aligned data.
Publicly available datasets of spatially and temporally aligned EO and IR images for the detection of UAVs are scarce. In [22], three methods to obtain paired EO and IR images for the detection of cars and people are studied. These include a Generative Adversarial Network (GAN) algorithm to generate images, a simulation environment, and a combination of both. The results show a poor performance of the detector when trained with the synthetic datasets and tested with real data. Even so, traditional data augmentation techniques such as image manipulation, image erasing, and image mixing can benefit a dataset and improve the detection results.
Small-object vision-based detection remains a challenging topic, despite extensive progress over the years to improve both the processing time and accuracy of these systems. Comparative research into sensors and sensor integration techniques is important.
The present work aims to perform the detection and tracking of small non-cooperative UAVs using EO and IR imaging sensors. It makes a comparison between the use of each sensor independently and two sensor fusion algorithms, at the decision and pixel levels. Since the main objective is to use real data to evaluate the performance of the architectures in different conditions and scenarios, a further goal is the construction of the necessary spatially and temporally aligned EO and IR datasets of UAVs. This includes the creation of artificial data and the acquisition of real data during flight experiments (FEs) performed at the University of Victoria’s Center for Aerospace Research (UVIC-CfAR). By conducting extensive robustness tests and validating the system using real flight data, a conclusion on the most appropriate method in terms of performance and real-time capabilities for the cases explored is made.

2. Detection and Tracking Architecture

2.1. Proposed System Architecture

The system architecture depends on the data fusion methodology that can happen at the output or input level from the point of view of the detector.
The decision-level data fusion case is a late-fusion method, in which a decision is made by combining the outputs of different algorithms previously processed. In this case, firstly, both images are fed to separate detectors previously trained for the corresponding image types. Then, the algorithm takes both output detections and computes the final result before feeding it to the tracker. This process for a single frame is shown in Figure 1.
For the pixel-level data fusion, the raw pixels from multiple sources are combined, so the fusion occurs on a pixel basis. This is an early-fusion method because it occurs before image classification. In this case, in the first step, both images are merged into a single one that preserves the relevant features of each, and then the resulting image is submitted to a single detector and tracker. The steps of this method are depicted in Figure 2.
In this project, the object detector selected was YOLOv7 that uses a deep CNN in order to identify objects in images [23]. It belongs to the current state of the art for the real-time detection You Only Look Once (YOLO) family of object detectors. This choice was based on the performance improvements both in terms of accuracy but especially processing time that the authors of YOLOv7 managed to achieve. Even though more recent versions are available at present, not enough literature had been produced for those at the time of this selection, so the YOLOv7, which had been extensively reported, was selected. Despite the development and progress of object detectors in recent years, there are still relevant challenges that were considered in this work. These include intra-class variation, where detectors may fail to detect objects of the same class not integrated in the dataset, and inter-class variation, where object detectors may fail to distinguish different classes. Hardware requirements are also a relevant limitation since computer vision algorithms often have a high demand for memory and lead to intensive training sessions.
As for the tracker, the state-of-the-art ByteTrack was selected [24]. ByteTrack is a tracking-by-detection tracker that uses Intersection over Union (IoU) to associate detections provided by the object detector with the tracks stored in memory. Based on the detection results, the tracker creates an ID of the object and follows its trajectory in consecutive frames, thus giving it the same ID, or a new instance, if it is a new different detected object. It also uses a Kalman filter to make predictions on the position of the objects in the current frames, given their location in the previous frames. As opposed to most trackers, ByteTrack keeps all the detections provided by the associated detector to increase robustness to cases of occlusion, motion blur, and variable bounding box size increases. In this project, the tracking task performed refers to following the trajectory of a UAV in videos, not to following its trajectory during its flight by having the sensors autonomously move.
Image alignment in consecutive frames was additionally considered for the system to compensate for camera motion, before submission to the tracker. Here, Enhanced Correlation Coefficient (ECC) maximization was chosen to estimate the parameters of the motion models for the system [25]. This gradient-based iterative method is robust against geometric and photometric distortions. Previous work developed at UVIC-CfAR showed tracking improvements when this algorithm was applied [26].

2.2. Data Fusion Methodology

The decision-level data fusion algorithm used in this project was developed with the main goal of improving the detection task results with a minimal impact on the processing time. For this reason, it considers one data type as the main one, thus favouring one of the cameras, and the other one as confirmation data only accessed in certain scenarios. According to this principle, the frames produced by the head sensor, that can be either camera, are always fed to the corresponding head detector and its predictions processed. Based on the characteristics of these detections and the need for a complement on the information gathered, the confirmation sensor may be used. The criteria used to determine the need for a confirmation model for each frame is mainly based on two factors related to the predictions of the head detector: the number of predictions in every frame, that aims to eliminate FP or False Negative (FN) detections, and the confidence score associated. When the confirmation model needs to be accessed, the system has two prediction sets, the head, and the confirmation predictions, and a balance between them is calculated. The first step is to match the detection pairs between both image types, if existent, that is, determine whether the detections from the head and confirmation models are of the same object or different instances. Once the matches are computed, the confidence thresholds of the detections are analysed and compared to two confidence thresholds, the primary and the secondary thresholds, previously established. It is sufficient that an object is detected only in one image type with a confidence score higher than the primary threshold to be considered valid, but it is necessary that an object is detected in both models with confidence scores higher than the secondary threshold to be valid, if none is higher than the primary threshold. In this project, both the EO and the IR sensors were tested for both positions. This approach to the fusion has the advantage of saving detection processing time since there is a reduced number of instances when both models are loaded. This algorithm requires that the videos from both sensors have the same frame rate.
For pixel-level data fusion, extensive research on algorithms developed for the fusion of EO and IR images, despite not being specific for the UAV case, was conducted [27]. It contains a comparative study on both algorithm performances through the analysis of the resulting fused images on the selected metrics, and processing time. In this project, the requirement for an adequate fusion time per image was considered more important and, so, the algorithm selected, that achieved a good performance in the analysis in the mentioned paper, is called FusionGAN [28]. This aims at producing enhanced greyscale images by taking advantage of the thermal information while maintaining visible textures. Here, the GAN generator is trained to create images that take into consideration the thermal intensities of the IR image along with the image gradients present in the EO images. The GAN discriminator is trained to maximize the presence of the features of the EO image in the fused image. The final resulting image is selected when the discriminator can no longer distinguish the images produced by the generator from real ones. An example of the application of the FusionGAN is shown in Figure 3. One advantage of this algorithm is that it is possible to perform the fusion of the images when they do not have the same resolution by up-sampling the low-resolution image. This is important because the available EO cameras generally have much higher resolutions than IR sensors. A relevant limitation of this algorithm is the high amount of GPU memory it requires to process the images and produce the fusion.

3. Dataset

Two datasets were created due to the requirements imposed by the data fusion methodologies selected, using data collected during flight experiments: the labelled dataset and the inference dataset. The labelled dataset was used to train object detectors, and included spatially and temporally aligned variations of EO, IR, and Pixel Fused real and artificial images. Each of these sub-datasets consisted of a total of 5977 labelled images of UAVs. In turn, the inference dataset consisted of EO, IR, and Pixel Fused real videos that were spatially and temporally aligned but not labelled, that totalled 35,907 frames.

3.1. Experimental Work

The main components of the video capture system used at UVIC-CfAR were the cameras, the analogue-to-digital signal converters, and the software used. Figure 4 shows the experimental setup, where the sensors are highlighted in red.
The EO camera sensor was a SONY FCB-EX1020 PAL and the IR camera sensor was a FLIR TAU 640 PAL. These sensors were integrated in a TASE 200 gimbal, so there was a fixed displacement between them. This displacement was measured to be 50 mm ± 0.5 mm. The camera parameters were controlled using ViewPoint software. Each sensor was integrated with a low-latency video encoder, Antrica ANT-1772, that can stream in both RTSP and MPEG TS formats over an Ethernet connection. The software Neptune Guard was used to configure each encoder. The streams were displayed and recorded with Neptune Player that has a very-low-latency viewer. The OBS program was also used simultaneously for recording and live streaming the same videos.
The equipment was on a workstation on the ground capturing the aircraft in the air and not mounted onboard an aircraft. The operator manually moved the gimbal to include the aircraft in its field of view, and no autonomous gimbal movement was used. This means that the cameras were fixed with respect to each other, but the gimbal was moving to record the aircraft. Both sensors were always set to start recording at the same time to contribute to the temporal alignment of the frames.
In case this system is mounted onboard an aircraft in future work, the main equipment change that needs to be made is replacing the computing unit with an embedded system. The weight of the whole system to be deployed must also be taken into consideration. The selection of the detector aircraft will depend on this payload weight, which would include an embedded system, the TASE 200 gimbal, the two video encoders, batteries to power these components, digital datalinks for communication with the ground station, and the necessary cables.
First, sensor calibration was performed for the two sensors independently to determine their geometric parameters since image alignment is essential to guarantee the success of the FusionGAN algorithm. The MATLAB simple camera calibration app was used for the calibration. Based on the Pinhole Perspective camera model, the intrinsic, distortion, and extrinsic parameters of each sensors were estimated [29]. The calibration results are presented in Table 1, where the most significant difference is observed in the distortion coefficients, namely for radial distortion. In fact, the distortion that the IR sensor causes in the images is noticeable to the naked eye in some cases. However, this is found to have a negligible impact on the alignment of the UAVs at longer ranges.
One aspect to consider was that the IR sensor could not capture a defined image at a close range. For this reason, a calibration board printed in the standard A4 or A3 sizes would be too small to be captured, appearing blurry through the IR sensor, which would make the calibration process impossible. Instead, a 10 × 7 calibration board with 15 cm × 15 cm squares, totalling 150 cm × 105 cm, was built to capture the images, as can be seen in Figure 5.
The goal of the most relevant flight experiments was to collect data containing as much variety in operational conditions as possible, in the form of videos, in mp4, recorded at 25 fps, to integrate the dataset to train the object detector models and test the system.
Flight Experiment A was an operation at UVIC-CfAR with the aircraft Mini-E [30], illustrated in Figure 6a. This aircraft was flying in circles passing through the waypoints shown in Figure 7a. Additional flights of a DJI Mavic 2, displayed in Figure 6b, were performed in the same day to guarantee more variety on a more common aircraft. This UAV was flying following straight lines as can be seen in Figure 7b. The total recorded flight time was 51 min 21s for the Mini-E and 22 min 28 s for the DJI Mavic 2. During the postprocessing stage of the raw videos, it was concluded that the EO and IR frames were not always totally aligned. This happened mostly when one of the sensors of the gimbal automatically adjusted a camera parameter, creating a lag on the transmission of the videos.
Flight Experiment B was also an operation at UVIC-CfAR with the main goal of gathering data on the hybrid multirotor [31], shown in Figure 6c. The flights captured include vertical take-off, hovering, and landing. Since this flight experiment was performed inside a gymnasium and hence the background is similar in all frames, only a total of 2 min 10 s was recorded.
Flight Experiment C was conducted at UVIC-CfAR with the main goal of gathering footage to include in the dataset to test the system on inference. This includes both the same DJI Mavic 2 of Flight Experiment A, shown in Figure 6b, to assess the different performance results of the system with the same aircraft but different conditions, and also a DJI Inspire 1, that can be seen in Figure 6d, to test the robustness of the system to intra-class variation. The flight paths chosen for this experiment were the same for both UAVs, that followed straight lines as depicted in Figure 7c. The DJI Mavic 2 and DJI Inspire 1 were recorded during a total of 19 min 18 s and 24 min 35 s, respectively.
Additional data were collected on Flight Experiment D conducted at Instituto Superior Técnico (IST) with a TeAx ThermalCapture Fusion Zoom. The frames provided include a Zeta FX-61 Phantom Wing and a DJI Mini 3 Pro, as can be seen in Figure 6e,f, respectively.
In general, the data include frames with the UAV blurred or partially cut, the presence of birds in some frames, frames above and below the local horizon, and a background with variety in objects, especially trees, houses, and farming tools. There is also variety in the range of the UAV and its position in the frames. In terms of lighting, variety includes bright images taken during summer days, indoor images with artificial lighting, and images taken at twilight in autumn.

3.2. Labelled Dataset

The real data in the labelled dataset are the data collected during Flight Experiments A, B, and D.
As mentioned, the data gathered during Flight Experiments A and B was not completely aligned, so postprocessing included calibration, spatial and temporal alignment, and the application of a method for bias removal.
The image calibration eased the spatial alignment of the images by effectively adjusting several parameters, including tangential and radial distortions. Nevertheless, aligning two images that were captured by distinct sensors is a challenging task. For this reason, the image alignment was a supervised procedure, mainly due to three reasons encountered. First, some video frames were not perfectly aligned in time. Secondly, the IR sensor has a much slower shutter speed than the EO sensor, which often led to motion blur for higher UAV velocities. Since altering shutter speeds was not an option of the software used, a decision was made to include this type of pairs in the dataset. Lastly, the parallax effect results from the fixed displacement of the camera lenses when they were mounted in the gimbal. This effect was more noticeable when the UAV was near the sensors, and negligible at long range, having no effect on alignment.
For Flight Experiments A and B, further work was carried out to eliminate the dark gradients that the IR sensor created and that are especially pronounced in the corners of the images. Computer vision algorithms may be significantly impacted by this kind of effect. First, this was considered a vignette effect, that is a brightness attenuation away from the image center, and treated as such, using a method to estimate it from a single image [32]. This approach did not yield satisfactory vignette function estimations for all the images. For this reason, it was considered an intensity nonuniformity, that is a bias that can be caused by illumination changes, thus taking the perturbation as a variation in intensity that does not take a specific distribution [33]. By using this strategy, the results for gradient estimation and hence bias removal from the images was acceptable. One result example of this procedure is shown in Figure 8. The usage of such an approach can generate more noise, so, for research purposes, the IR dataset was duplicated and the bias-removal algorithm applied to the copy. A Pixel Fused dataset with the FusionGAN algorithm using as inputs the EO and IR with bias removed was also created. The goal was to evaluate the effect of the image correction by comparing the performance of the object detectors with the original images and bias-corrected images.
For the Flight Experiment D data, the software used for data capture performed image alignment, so the frame selection was solely supervised to guarantee the elimination of outliers.
For the real data captured, the UAV had to be labelled by outlining its bounding box, assigning it a class, and producing a .txt file in the YOLO format. The accuracy of the labels associated with each object has a significant impact on the performance of an object detector. Two main labelling strategies were considered. First, by sending the data to another object detector trained for the same purpose, a DL-assisted methodology could be implemented. The label files could be created using the output detections of this extra detector. However, this method has an additional source of error, since it is dependent on the precision, recall, and accuracy of this detector, which might have significant impacts on the dataset. For this reason, although being more time-efficient, to guarantee a good outcome, this method should be supervised and the results verified. Secondly, the dataset could be manually labelled by outlining the UAV, if present, and creating the .txt file. Even though the latter is especially time-consuming for large datasets, a decision was made to manually label all the real data that was featured in the labelled dataset (5977 images) to avoid the error a DL-assisted method can have.
Finally, a data augmentation strategy was used to create artificial images. Figure 9 shows three examples of the final produced pairs of images of UAVs. The method consisted of firstly placing background-transparent images of UAVs in spatially and temporally aligned images of backgrounds, and in the end applying random features such as a brightness change to increase variety in the dataset. This approach is especially interesting in this particular scenario as there are not many restrictions on the position of a UAV within an image. For instance, this method would not be effective for a dataset of railed vehicles that have to be placed on rails for an image to be plausible. The algorithm started by making a random choice of an image pair for the background and of a UAV image. Due to the lack of publicly available datasets of paired UAV images, the algorithm took as input only an EO image of the UAV. The corresponding IR image was created by the algorithm by applying a transformation to the EO image. This gave it a random greyscale intensity, between a range of values, and applied a random level of blurriness to its outline. This approach was selected after performing some tests that reached the conclusion that this was the method that created IR images of the UAV most similar to the ones from the IR sensor used in the flight experiments. Then, it randomly chose the size of the UAV in the frame, followed by the random selection of its position in the image. It also incorporated the options to rotate the UAV, and change the brightness, contrast, and blurriness level of the produced image. The background images used were obtained by the TASE 200 both during Flight Experiment A and during extra experiments at UVIC-CfAR. As for the UAV images, eight different models that included quadcopters, a hexacopter, and a fixed wing were used. This process has the advantage of automatically producing the labels in the YOLO format.
To sum up the labelled dataset, there were a total of five variations of spatially and temporally aligned images: EO, IR, IR with bias removed, Pixel Fused, and Pixel Fused with bias removed. Figure 10 shows three frame examples of the dataset. The image size is variable in a few pixels due to the image alignment, but is approximately 640 × 512 for all images. From the total, some are images with no UAVs, and the remaining ones have a total of 11 different aircraft. About 20% of the images are artificially created. The comparison of the architectures is more legitimate by having the datasets identical apart from in image type or data fusion methodology.
In this project, to have as much data in the training set as possible because the dataset was relatively small, the 80-10-10 partition for training, validation, and test sets was used.

3.3. Inference Dataset

The videos that were used to evaluate all of the different systems on inference were the videos from Flight Experiments A that were not used to build the labelled dataset and the videos from Flight Experiment C. The videos were separated into segments of interest to discard the asynchronous video segments, avoid testing in videos with sudden camera movements caused by the operator, and isolate the desired testing features and conditions. These include the separation of the segments into the UAV range, based on the number of pixels the UAV occupied in the images, and type of background, depending on whether the UAV was above or below the local horizon. The cases with a blurry UAV and partially cut UAV were also isolated. Finally, to assess the system robustness to intra-class variation and inter-class variation, segments featuring an aircraft that was not used to train the detectors and segments with the presence of birds, respectively, were isolated.

4. Results and Discussion

This section describes the detection and tracking results obtained by the system. Firstly, the training and testing processes of the YOLOv7 detector for the five different variations of the labelled dataset are explained. A comparison with the YOLOv7-tiny model is conducted. Secondly, the models are tested independently on the inference dataset to evaluate and benchmark their performance. Thirdly, both data fusion architectures are tested on the inference dataset, and the average results presented. These sections also include the isolated testing and comparison of the systems in specific challenging cases for UAV detection.

4.1. Detector on Labelled Dataset

The five different labelled datasets were used in YOLOv7 model training sessions using the same parameters for all, for comparison reasons. Three different GPUs were used: the NVIDIA GeForce RTX 4080, NVIDIA GeForce RTX 2070 SUPER, and the NVIDIA GeForce RTX 3050.
The resulting models were evaluated using the test set images. The results are presented for precision, recall, Mean Average Precision at 0.5 ([email protected]), where the IoU threshold used is 0.5, and Mean Average Precision at [.5:.95] (mAP@[.5:.95]), where a set of thresholds from 0.5 to 0.95 with a step of 0.05 are averaged. Table 2 shows the YOLOv7 results for 500 epochs, where BR stands for Bias Removed.
It is possible to conclude that all models had a similar performance on the respective test sets. The lower values for mAP@[.5:.95] mean that the bounding boxes outlined by the detectors were not always exactly placed, even if the detection of the UAV was correct.
As for the differences between datasets, both the IR and both the Pixel Fused models always outperformed the EO model, although it was not a significant improvement. One possible explanation for this is the extra information an EO model has to learn, since it involves colours, while the other models deal with intensities. Additionally, one present case in this dataset that benefited the IR and Pixel Fused models was when the UAV was below the local horizon, thus having a textured background. Although this was not found in great abundance, this may have led to FNs for the EO detector, especially if the UAV had similar colours to the scene.
Finally, it is relevant to compare the IR and Pixel Fused models with the corresponding models with bias removed. In general, the metrics showed a better performance for the original models. Firstly, the bias-removal algorithm might have introduced noise in the images, deteriorating the results. Secondly, the bias estimation and then removal procedure might have removed pixel intensity from the UAV, reducing its highlight and thus disturbing the detection process. Finally, it is possible that the bias inherent in the original images had no negative effect on the results. This could be because the gradients were not fixed for all the images, that is, the bias mask was not constant, and due to the fact that the dataset included variety in the position of the UAV in the frame. Since no significant improvements were observed, further tests consider the models without bias removal.
By examining the output images that are a part of the test set, it was possible to isolate the operational conditions that contributed to an increase in FPs and FNs leading to a decrease in precision and recall, respectively. For precision, the models often mistook birds for UAVs, besides producing some FPs in the presence of background objects such as houses. For recall, the majority of the FNs occured when the UAV was flying below the local horizon with a textured background. The IR sensor presented higher robustness to this scenario. Apart from these impacts, there were also other conditions with particular interest in the context of this project, such as when the UAV appeared blurry or partially cut in the image, and intra-class variation. This can be tested with the data from Flight Experiment C that includes an aircraft not featured in the dataset for training.
Additional training sessions were performed using the YOLOv7-tiny model. This is a similar approach to the YOLOv7 model but this configuration takes a reduced number of parameters, thus using less GPU memory, which makes it faster and less resource-intensive. The tests and results using the YOLOv7-tiny model are relevant in case the system is implemented onboard an aircraft. In this scenario, the computing system used needs to be changed to an embedded system, which limits the frame rate since its parallel processing ability is significantly constrained.
Table 3 shows precision, recall, and mAP results as presented for the YOLOv7 model. In this case, an additional column with the percentage of decrease in the processing time of the YOLOv7-tiny model, when compared to the YOLOv7 model, is presented.
When compared to the results from Table 2, accuracy decreases when using the YOLOv7-tiny model, even though it is not very significantly. In terms of processing time, however, the decrease obtained is on average 56.8% across datasets. These results are relevant for a real-time implementation of the system onboard an aircraft, which requires the models to run on an embedded system.
Since further testing in this project was performed offline using an NVIDIA GeForce RTX 4080, which is a relatively fast GPU, it was decided to continue with the YOLOv7 models, thus favouring performance metrics instead of the frame rate.

4.2. Overfitting

Overfitting is a concerning problem in a detection system. In particular, for this project, overfitting was carefully examined since the detector was trained using an original and relatively small dataset, and some precautions were also taken to prevent it. Firstly, the validation set of data was used to perform cross-validation. This process consisted of the constant evaluation of the models using the validation set during the training process in order to assess their capacity to generalize to different data. Secondly, data augmentation was used mainly by activating the YOLOv7 built-in augmentation option. This includes the application of methods such as translation, cropping, noise, brightness, contrast, saturation, and Gaussian blur to the images during the training stage. Finally, there was a strict control over the overall number of training epochs used for each model, and the training sessions were stopped when no significant improvements were verified. Since, as shown in Table 2 for a training session similar to all five datasets, the models presented good results, showing the ability to make accurate predictions for the validation data, and also for new data in the test set, it was concluded that they were not overfitting. To further prove this for the EO, IR, and Pixel Fused models, independent tests in inference were conducted using video segments with more variety in certain conditions. Taking this into account, it was concluded that the models were not overfitting, despite not always being robust to all conditions and variables. For the goal of this study, which is a proof of concept for data fusion techniques, it was decided that the models in their present condition were adequate and sufficient.

4.3. Independent Model Testing

Independent model testing refers to the testing of the system on the inference dataset using each one of the sensors separately and without any data fusion methodology associated. The aim is to benchmark the performances of just the detector, the detector plus tracker, and the ECC algorithm. The results obtained using the NVIDIA GeForce RTX 4080 GPU are presented separately from Flight Experiments A and C. This is due to the fact that the labelled dataset includes frames taken from other videos during Flight Experiment A, sharing similar conditions in terms of lighting and background. In this way, the segments from Flight Experiment A are used for the comparison of data fusion architectures and the effectiveness of algorithms, and the segments from Flight Experiment C to test system robustness to different conditions not featured in the training data.
Table 4 shows the average detection and tracking results of precision, recall, frame rate, and the average number of Identification Switches (IDSs) for every 100 frames of the tracker.
It is observable that the results for Flight Experiment A are better in terms of recall, in general, due to the frequent presence of birds, which can lead to FPs. In turn, as expected, the results on the video segments from Flight Experiment C are better for precision due to the variety in conditions and UAV models that these data have, which were not included in the dataset and that can lead to FNs. In fact, the predominance of video segments with the UAV below the local horizon is purposely much higher for Flight Experiment C, since it is one of the target conditions to be analysed by the data fusion methods. One other factor to consider is the lighting condition of the environment, which was recorded predominantly around twilight, where the sky is exceptionally bright. This led the images below the local horizon to be very dark in contrast with the sky, and so the UAV was often undetectable, even to the naked eye, as depicted in Figure 11. This affected images provided by both sensors. When comparing the performance of both sensors separately, it is possible to conclude that, for Flight Experiment A, both precision and recall, and also frame rate, are generally better for the EO sensor. For Flight Experiment C, these results are more variable depending on the range. In terms of range, the performance is better for the medium and far ranges than for the close range. In particular, for Flight Experiment C, the recall obtained using the IR sensor for the close range is lower because in most frames the UAV is closer than the focal distance of the sensor, hence appearing very blurry in the images. For the very far range in Flight Experiment C, both models underperform. This leads to the conclusion that this range limits the system.
As for the average number of IDSs per 100 frames, it is generally lower for the EO sensor than for the IR. Visual analysis of the output videos with IDs led to the conclusion that there are mainly three reasons for the missed tracks of the tracker. Firstly, there is the case when the UAV is not detected at all, leading to FNs. Here, if it happens in consecutive frames and the UAV is lost from the list of tracks, the tracker assigns it a new ID. Secondly, the tracker incorrectly assigns an ID to an object when it is detected in a series of successive frames, such as when there are FPs on birds that follow a trajectory. Finally, abrupt camera movements can negatively impact the tracker performance, even when subtle. This effect is more significant at longer ranges because the predicted bounding box is smaller, and so the probability of camera movements leading to a miss in overlap of consecutive bounding boxes is higher.
As for the frame rate, the system is able to process from 97.6 to 117.0 frames every second. This is a relative value that highly depends on the hardware used.

4.3.1. ECC Algorithm

The inclusion of the ECC algorithm in the system, considering translation as the motion model, to attempt compensating for camera motion, showed a dependency on feature-rich and textured backgrounds, for both image types. The transformation of the images and bounding boxes to the frame of reference was successful when the UAV was below the local horizon, with more details and an abundance of motionless objects, or at a close range. At closer range, however, the effect of camera movement was less noticeable due to the higher bounding box size that led to a higher likelihood of overlap for consecutive frames. In fact, analysis of the output videos of the independent models showed that for this case the tracker failed more often due to the FP and FN detections. On the other hand, the performance deteriorated for images with the sky as background or moving objects, especially when the UAV was flying at a long range. In both cases, the algorithm increased the system frame rate.
Given the mentioned reasons, the algorithm was not considered beneficial for the present study and further tests do not include its application. It is important to emphasize that implementations of the ECC algorithm have shown acceptable results and improvements in the tracker performance, namely in the project developed in [26]. Although it was discarded for the present work, it is still regarded as valuable tool for image alignment, and its implementation worthwhile for different contexts. Therefore, the ECC should be re-tested for an online implementation of the system onboard an aircraft, which is more susceptible to sudden camera movements that cannot be filtered out.

4.3.2. Target Operational Conditions

Finally, the target cases mentioned were analysed separately.
The results showed both sensors managed to accurately make predictions of the UAV when it is blurry or partially cut, and so independent models showed a high system robustness in these situations. One example for each is depicted in Figure 12.
As for intra-class variation, some conclusions were drawn from the analysis of the output video segments of the particular case of the aircraft that was not included in the training data, for independent models. In general, precision remains similar, which means the number of FPs was not highly impacted, as expected. However, for recall, the detector fails more often in predicting a UAV that was not included in the labelled dataset. However, in most cases, it makes a successful detection, but with a lower confidence score.
One of the most frequent problems that object detectors face is the existence of similar objects in the images that do not belong to the class being detected. In the case of UAV detection, birds are the main concern. The analysis of this case on the independent models showed that there was a decrease in average precision for both models that did not depend on UAV range, even though the value for recall remained similar to the average.
Finally, there was a decrease in average recall for the textured background case, and the performance of the detector was much better when the UAV was above the local horizon, having the sky as the immediate background. Even so, the case where the IR model made a detection but the EO model failed was more common because the IR signature of the UAV was highlighted against the background. Conversely, the EO images had more detail at a closer range and so the detection was more likely.
Thus, the particularly relevant cases to analyse with the implementation of the data fusion methodologies are the intra-class variation, presence of birds, and textured background scenarios, due to the lower robustness the independent models presented. One example for each of these scenarios is depicted in Figure 13. Both in the intra-class and textured background target cases, the figure shows an example with a successful detection in the EO sensor and a FN in the IR sensor, which the data fusion methodologies aim to eliminate. For the presence of birds case, both sensors have a FP detection on a bird besides the UAV.

4.4. Data Fusion Architecture Testing

The implementation of the decision- and pixel-level methodologies focuses on the improvement in precision and recall when compared to the independent models. It is important to note that the video segments from Flight Experiment A were not suitable to test the pixel-level data fusion architecture due to often degraded video alignment.
Table 5 shows the average results of the detector and tracker for Flight Experiments A and C. Here, for the decision-level architecture, the designation EO-IR refers to the case that has the EO model as the head model and the IR model is used for confirmation, and vice versa. In the pixel-level data fusion case, only tested for Flight Experiment C, the comparison is made with both the EO and IR independent models separately, in the format EO|IR.
For the decision-level architecture, precision and recall increase on average 3.9% and 3.6%, respectively, for the EO-IR configuration, that is, when the EO results are occasionally complemented by the IR results, when compared to the EO independent model. As for the IR-EO configuration, there is an average increase in precision of 5.3% and recall of 3.2% when compared to the IR independent model. The average lower results for precision and recall for Flight Experiment C are due to the fact that the independent models also have lower results for this data, and it does not directly mean that the algorithm is underperforming. In fact, in terms of percentages, the improvements that the algorithm manages to accomplish are similar between flight experiments. In some cases, even though there is a significant increase in precision, it comes at the cost of a reduction in recall. This compromise might or might not be worthwhile depending on the system requirements. One limitation of this architecture is that it is always conditioned by the performance of the sensors independently. This influences mainly recall because if, for instance, both cameras happen to have a FN, the system with the decision-level data fusion will preserve the FN and keep recall unchanged.
For the pixel-level approach, the results for precision and recall underperform when compared to both decision-level architectures for all ranges and, in most cases, do not show improvements when compared to the use of each sensor independently. The main factors influencing the results are the lighting conditions that were experienced during Flight Experiment C, that led the fusion to produce images that appear distinct from the ones featured in the dataset, and the imperfect video alignment. In fact, it is possible to conclude from the analysis of the output videos that the performance of the model is worse in the frames of the video segments when the alignment starts to fail, mostly in terms of recall, such as the one illustrated in Figure 14a. Evidently, the longer the range of the UAV, the more significant the impact of a failure in alignment is. As can be seen in the example in Figure 14b, in some of the video segments, a failure in alignment can cause a complete miss in overlap of the UAV from the two sensors. Here, the models frequently produced FPs by detecting two UAVs, as opposed to FN detections.
As for the number of IDSs per 100 frames, for the decision-level architectures, on average, this value is reduced in all cases presented, namely due to the reduction of FPs. For the pixel-level architecture, no pattern is verified for the number of IDSs per 100 frames since precision and recall also either suffer an increase or decrease.
For frame rate, since the decision-level algorithm only resorts to the confirmation model if necessary, the average only drops from 6.5 to 34.3 fps for the close, medium, and far ranges, despite presenting a high variance. For the pixel-level fused models, when compared to the independent models, there is a lower processing time. However, this is not taking into account the fusion time.

Target Operational Conditions

In terms of target conditions, the results for the data fusion architectures for the isolated conditions with low system robustness on independent models are studied.
First, the intra-class variation case contributes to decreasing the average recall of the decision-level architecture, that is, this case has lower recall than the average. Even so, compared with the independent models, these architectures show a more significant improvement. In fact, the average recall has an increase of 4.1% for the EO-IR configuration and of 4.2% for the IR-EO configuration. With the exception of the very far range limit of the system, the decision-level architecture shows robustness to intra-class variation. For the pixel-level data fusion, in the intra-class variation scenario, the results are comparable to the average, and the system still underperforms for recall. Figure 15 shows the corresponding images to Figure 13a,b, with the application of the data fusion methodologies.
One of the independent models fails at detection of the UAV in the intra-class variation scenario, but the decision-level data fusion algorithm manages to make the detection in both cases and recall is improved. In this example, the pixel-level data fusion misses the detection. This specific case was found to be common in the tests.
In the presence of birds, the decision-level data fusion algorithm successfully manages to improve precision. On average, this metric was improved by about 8.2% and 11.3% for the EO-IR and IR-EO configurations, respectively. Since this scenario is only significant for Flight Experiment A, the pixel-level architecture could not be tested for this case. Figure 16 shows the application of the decision-level data fusion architectures on the same example as in Figure 13c,d, where both the independent models identify a bird as a UAV. As can be seen in the images, these FPs are eliminated by the decision-level data fusion algorithm, both having the EO or IR sensor as the head sensor.
Finally, for the textured background scenario, there are predominantly two reasons affecting the results. First, when the range increases and the UAV size is limited to fewer pixels, if the background is textured, the UAV becomes more easily mistaken by it. This was also experienced by the naked eye during the flight experiments. Secondly, the lighting condition of the scenario causes a decrease in recall, especially for the twilight videos from Flight Experiment C. Besides these factors, the decision-level architecture manages to decrease the number of FNs and hence improve recall, and increase precision, often significantly. The recall is improved, on average, by about 5.2% and 6.9% for the EO-IR and IR-EO configurations, when compared to the EO and IR independent models, respectively. However, the performance results of the detector using the IR model are not as superior as expected. In fact, one of the main reasons to choose the use of the IR sensor was its ability to highlight the UAV in scenarios where it is easily mistaken for background through an EO camera, and imperceptible to the naked eye. Furthermore, during the flight experiments, when the UAV was at a long range and flying below the local horizon, it was only possible for the operator of the sensors to visually detect it through the IR sensor. Given this, the lower recall can be due to mainly two factors. Firstly, the IR images are more granular and not as sharp, when compared to the EO ones, which means that, at a long range, regardless of the background, the detection task becomes more challenging. Secondly, the results can also be conditioned by the quality of the dataset itself. Even though it is capable of generalizing, since it did not include as many scenarios with textured background as desired, recall in this scenario suffered a decrease. For the pixel-level architecture, no consistent improvement is observed, even though some cases show significant increases in precision and recall. Figure 17 shows the same frames as Figure 13e,f, but with the implementation of the data fusion methodologies. In the decision-level data fusion cases, the UAV is detected, even though one of the independent models fails to detect it. However, the pixel level often fails to detect the aircraft with a textured background, which is in accordance with the low values for recall shown in Table 5.

5. Conclusions

In this project, a detection and tracking system for small UAVs using an EO sensor and an IR sensor was developed, and a comparison of the use of these sensors independently with two data fusion methodologies was performed. To this end, flight experiments were conducted for data collection. As a result, additional contributions of this project are datasets of spatially and temporally aligned EO and IR data, one with labelled images of UAVs, and one with UAV videos, not labelled. Finally, the system was evaluated for different operational scenarios and target conditions, and tested using the flight test data experimentally collected.
First, YOLOv7 tests were performed for five variations of the labelled dataset: EO images, IR images, IR images with image bias removed, pixel-level fused images, and pixel-level fused images with image bias removed. Similar results for the performance metrics were obtained, achieving an average precision of 0.884, average recall of 0.854, average [email protected] of 0.885, and average mAP@[.5:.95] of 0.627. The dataset variations that had the bias removed were discarded since no significant improvements were observed.
Next, the detection and tracking tests were conducted, with the addition of the ByteTrack tracker to the system, on the inference dataset, using the independent EO and IR models to benchmark the performance of the sensors. Average results were presented for different ranges, and target conditions were identified. Both sensors exhibited acceptable performance in the blurry UAV and partially cut UAV scenarios, but precision suffered a decrease in the presence of birds, and recall suffered a decrease in the intra-class variation and textured background scenarios. Both decision-level and pixel-level data fusion methodologies were tested for the same video segments and target conditions. For the presence of birds case, the decision-level architecture showed significant improvements in precision, and thus tracker performance, at the cost of a decrease in frame rate. For intra-class variation, the decision-level architecture showed improvements for precision, recall, and tracker performance, but the pixel-level architecture under-performed, in general, when compared to both independent models and the decision-level architecture. For the textured background case, both precision and recall presented significant improvements with the application of the decision-level architecture, as opposed to the pixel-level architecture which was not considered beneficial.
To sum up, in general, the decision-level data fusion architecture showed the best performance, and its use proved to be promising. Furthermore, there is potential for optimization and enhancement in the implementation of this algorithm. Nevertheless, there is a compromise between increasing precision and recall, and improving tracker performance, with the decrease in frame rate. For this reason, the selection of the architecture depends on each C-UAS system and its goals and requirements, and must be thoroughly considered. The use of each sensor independently may also be beneficial for some scenarios. As for the pixel-level architecture, even though in this study it showed poor results and, in general, it is not considered advantageous, better equipment for a more accurate image alignment, and the use of other fusion algorithms may lead to improvement in this methodology.
The conclusions drawn on this proof of concept research, by comparing architectures for the EO and IR sensors, and data fusion methodologies, are a contribution to detection and tracking tasks, and a basis for future work on C-UASs.

Author Contributions

Conceptualization, A.P., S.W., A.M. and A.S.; methodology, A.P.; investigation, A.P.; resources, A.P. and S.W.; writing—original draft preparation, A.P.; writing—review and editing, A.P., S.W., A.M. and A.S.; supervision, A.M. and A.S.; project administration, A.S.; funding acquisition, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially funded by Fundação para a Ciência e a Tecnologia (FCT) under project LAETA Base Funding (https://doi.org/10.54499/UIDB/50022/2020). A.S. is grateful for the NSERC Discovery and Canada Research Chair Programs.

Data Availability Statement

The original data created in the study are openly available in Mendeley Data at https://doi.org/10.17632/sn9vy5c8sm.1.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
C-UASCounter-UAS
CNNConvolutional Neural Network
DLDeep Learning
EOElectro-optical
FEFlight Experiment
FNFalse Negative
FPFalse Positive
GANGenerative Adversarial Network
IDSIdentification Switch
IRInfrared
[email protected]Mean Average Precision at 0.5
mAP@[.5:.95]Mean Average Precision at [.5:.95]
MLPMultilayer Perceptron
RFRadio Frequency
UASUnmanned Aerial System
UAVUnmanned Aerial Vehicle
UVIC-CfARUniversity of Victoria’s Center for Aerospace Research

References

  1. Worldwide Drone Incidents. Available online: https://www.dedrone.com/resources/incidents-new/all (accessed on 19 January 2024).
  2. Castrillo, V.U.; Manco, A.; Pascarella, D.; Gigante, G. A Review of Counter-UAS Technologies for Cooperative Defensive Teams of Drones. Drones 2022, 6, 65. [Google Scholar] [CrossRef]
  3. Park, S.; Kim, H.T.; Lee, S.; Joo, H.; Kim, H. Survey on Anti-Drone Systems: Components, Designs, and Challenges. IEEE Access 2021, 9, 42635–42659. [Google Scholar] [CrossRef]
  4. Wang, J.; Liu, Y.; Song, H. Counter-Unmanned Aircraft System(s) (C-UAS): State of the Art, Challenges, and Future Trends. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 4–29. [Google Scholar] [CrossRef]
  5. Wang, B.; Li, Q.; Mao, Q.; Wang, J.; Chen, C.L.P.; Shangguan, A.; Zhang, H. A Survey on Vision-Based Anti Unmanned Aerial Vehicles Methods. Drones 2024, 8, 518. [Google Scholar] [CrossRef]
  6. Sun, Y.; Abeywickrama, S.; Jayasinghe, L.; Yuen, C.; Chen, J.; Zhang, M. Micro-Doppler Signature-Based Detection, Classification, and Localization of Small UAV with Long Short-Term Memory Neural Network. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6285–6300. [Google Scholar] [CrossRef]
  7. Passafiume, M.; Rojhani, N.; Collodi, G.; Cidronali, A. Modeling small UAV micro-doppler signature using millimeter-wave FMCW radar. Electronics 2021, 10, 747. [Google Scholar] [CrossRef]
  8. Yan, J.; Hu, H.; Gong, J.; Kong, D.; Li, D. Exploring Radar Micro-Doppler Signatures for Recognition of Drone Types. Drones 2021, 7, 280. [Google Scholar] [CrossRef]
  9. Dogru, S.; Marques, L. Drone Detection Using Sparse Lidar Measurements. IEEE Robot. Autom. Lett. 2022, 7, 3062–3069. [Google Scholar] [CrossRef]
  10. Nelega, R.; Belean, B.; Valeriu, R.; Turcu, F.; Puschita, E. Radio Frequency-Based Drone Detection and Classification using Deep Learning Algorithms. In Proceedings of the 2023 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 21–23 September 2023. [Google Scholar]
  11. Fu, Y.; He, Z. Radio Frequency Signal-Based Drone Classification with Frequency Domain Gramian Angular Field and Convolutional Neural Network. Drones 2024, 8, 511. [Google Scholar] [CrossRef]
  12. Shi, Z.; Chang, X.; Yang, C.; Wu, Z.; Wu, J. An Acoustic-Based Surveillance System for Amateur Drones Detection and Localization. IEEE Trans. Veh. Technol. 2020, 69, 2731–2739. [Google Scholar] [CrossRef]
  13. Ahmed, C.A.; Batool, F.; Haider, W.; Asad, M.; Raza Hamdani, S.H. Acoustic Based Drone Detection Via Machine Learning. In Proceedings of the 2022 International Conference on IT and Industrial Technologies (ICIT), Shanghai, China, 28–31 March 2022. [Google Scholar]
  14. Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-Based Anti-UAV Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
  15. Ghosh, S.; Patrikar, J.; Moon, B.; Hamidi, M.M.; Scherer, S. AirTrack: Onboard Deep Learning Framework for Long-Range Aircraft Detection and Tracking. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
  16. Coluccia, A.; Fascista, A.; Schumann, A.; Sommer, L.; Dimou, A.; Zarpalas, D.; Méndez, M.; de la Iglesia, D.; González, I.; Mercier, J.P.; et al. Drone vs. Bird detection: Deep learning algorithms and results from a grand challenge. Sensors 2021, 21, 2824. [Google Scholar] [CrossRef] [PubMed]
  17. Ding, L.; Xu, X.; Cao, Y.; Zhai, G.; Yang, F.; Qian, L. Detection and tracking of infrared small target by jointly using SSD and pipeline filter. Digit. Signal Process. Rev. J. 2021, 110, 102949. [Google Scholar] [CrossRef]
  18. Fang, H.; Ding, L.; Wang, L.; Chang, Y.; Yan, L.; Han, J. Infrared Small UAV Target Detection Based on Depthwise Separable Residual Dense Network and Multiscale Feature Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 1–20. [Google Scholar] [CrossRef]
  19. Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for Unmanned Aerial Vehicle-Based Object Detection and Tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 91–124. [Google Scholar] [CrossRef]
  20. Svanström, F.; Alonso-Fernandez, F.; Englund, C. Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities. Drones 2022, 6, 317. [Google Scholar] [CrossRef]
  21. Alldieck, T.; Bahnsen, C.H.; Moeslund, T.B. Context-aware fusion of RGB and thermal imagery for traffic monitoring. Sensors 2016, 16, 1947. [Google Scholar] [CrossRef] [PubMed]
  22. Yang, L.; Ma, R.; Zakhor, A. Drone Object Detection Using RGB/IR Fusion. In Proceedings of the Symposium on Electronic Imaging: Computational Imaging XX, Online, 17–20 January 2022. [Google Scholar]
  23. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  24. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. YOLOv7: ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
  25. Evangelidis, G.D.; Psarakis, E.Z. Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1858–1865. [Google Scholar] [CrossRef] [PubMed]
  26. Lopes, J.P.D.; Suleman, A.; Figueiredo, M.A.T. Detection and Tracking of Non-Cooperative UAVs: A Deep Learning Moving-Object Tracking Approach. MsC Thesis, Instituto Superior Técnico, Lisbon, Portugal, 2022. [Google Scholar]
  27. Sun, C.; Zhang, C.; Xiong, N. Infrared and visible image fusion techniques based on deep learning: A review. Electronics 2020, 9, 2162. [Google Scholar] [CrossRef]
  28. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
  29. Szeliski, R. Computer Vision: Algorithms and Applications, 2nd ed.; Springer: New York, NY, USA, 2021; pp. 33–96. [Google Scholar]
  30. Pedro, S.; Tomás, D.; Vale, J.L.; Suleman, A. Design and performance quantification of VTOL systems for a canard aircraft. Aeronaut. J. 2021, 125, 1768–1791. [Google Scholar] [CrossRef]
  31. Castellani, N.; Pedrosa, F.; Matlock, J.; Mazur, A.; Lowczycki, K.; Widera, P.; Zawadzki, K.; Lipka, K.; Suleman, A. Development of a Series Hybrid Multirotor. In Proceedings of the 13th EASN International Conference on Innovation in Aviation & Space for opening New Horizons, Salerno, Italy, 5–8 September 2023. [Google Scholar]
  32. Zheng, Y.; Lin, S.; Kambhamettu, C.; Yu, J.; Kang, S.B. Single-Image Vignetting Correction. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 2243–2256. [Google Scholar] [CrossRef] [PubMed]
  33. Zheng, Y.; Grossman, M.; Awate, S.; Gee, J. Automatic Correction of Intensity Nonuniformity From Sparseness of Gradient Distribution in Medical Images. In Proceedings of the 12th International Conference on Medical Image Computing and Computer Assisted Intervention, London, UK, 20–24 September 2009. [Google Scholar]
Figure 1. Decision-level data fusion stages.
Figure 1. Decision-level data fusion stages.
Drones 08 00650 g001
Figure 2. Pixel-level data fusion stages.
Figure 2. Pixel-level data fusion stages.
Drones 08 00650 g002
Figure 3. FusionGAN algorithm application: (a) Input Electro-optical (EO) image. (b) Input Infrared (IR) image. (c) Pixel Fused output image.
Figure 3. FusionGAN algorithm application: (a) Input Electro-optical (EO) image. (b) Input Infrared (IR) image. (c) Pixel Fused output image.
Drones 08 00650 g003
Figure 4. Experimental setup with highlight on the sensors.
Figure 4. Experimental setup with highlight on the sensors.
Drones 08 00650 g004
Figure 5. Calibration procedure using the calibration board created: (a) EO image at close range. (b) EO image at far range. (c) IR image at close range. (d) IR image at far range.
Figure 5. Calibration procedure using the calibration board created: (a) EO image at close range. (b) EO image at far range. (c) IR image at close range. (d) IR image at far range.
Drones 08 00650 g005aDrones 08 00650 g005b
Figure 6. UAVs captured during flight experiments: (a) FE A—Mini-E. (b) FE A—DJI Mavic 2. (c) FE B—MIMIQ. (d) FE C—DJI Inspire 1. (e) FE D—Zeta FX-61 Phantom Wing. (f) FE D—DJI Mini 3 Pro.
Figure 6. UAVs captured during flight experiments: (a) FE A—Mini-E. (b) FE A—DJI Mavic 2. (c) FE B—MIMIQ. (d) FE C—DJI Inspire 1. (e) FE D—Zeta FX-61 Phantom Wing. (f) FE D—DJI Mini 3 Pro.
Drones 08 00650 g006
Figure 7. Schematics of flight paths: waypoints (blue) and workstation (red): (a) Flight Experiment A (Mini-E). (b) Flight Experiment A (DJI Mavic 2). (c) Flight Experiment C.
Figure 7. Schematics of flight paths: waypoints (blue) and workstation (red): (a) Flight Experiment A (Mini-E). (b) Flight Experiment A (DJI Mavic 2). (c) Flight Experiment C.
Drones 08 00650 g007
Figure 8. Bias-removal algorithm application: (a) IR original image. (b) IR bias-corrected image. (c) Estimated bias.
Figure 8. Bias-removal algorithm application: (a) IR original image. (b) IR bias-corrected image. (c) Estimated bias.
Drones 08 00650 g008
Figure 9. Artificial image pair creation algorithm: (ac) EO images. (df) IR images.
Figure 9. Artificial image pair creation algorithm: (ac) EO images. (df) IR images.
Drones 08 00650 g009
Figure 10. Dataset examples: (ac) EO images. (df) IR images. (gi) IR images with bias removed. (jl) Pixel Fused images. (mo) Pixel Fused images with bias removed.
Figure 10. Dataset examples: (ac) EO images. (df) IR images. (gi) IR images with bias removed. (jl) Pixel Fused images. (mo) Pixel Fused images with bias removed.
Drones 08 00650 g010
Figure 11. UAV recorded at twilight: (a) EO image. (b) IR image.
Figure 11. UAV recorded at twilight: (a) EO image. (b) IR image.
Drones 08 00650 g011
Figure 12. Independent model detection and tracking on higher robustness target cases: (a) EO blurry UAV image. (b) IR blurry UAV image. (c) EO partially cut UAV image. (d) IR partially cut UAV image.
Figure 12. Independent model detection and tracking on higher robustness target cases: (a) EO blurry UAV image. (b) IR blurry UAV image. (c) EO partially cut UAV image. (d) IR partially cut UAV image.
Drones 08 00650 g012
Figure 13. Independent model detection and tracking on lower robustness target cases: (a) EO intra-class variation image. (b) IR intra-class variation image. (c) EO presence of birds image. (d) IR presence of birds image. (e) EO textured background image. (f) IR textured background image.
Figure 13. Independent model detection and tracking on lower robustness target cases: (a) EO intra-class variation image. (b) IR intra-class variation image. (c) EO presence of birds image. (d) IR presence of birds image. (e) EO textured background image. (f) IR textured background image.
Drones 08 00650 g013
Figure 14. Alignment failure on Pixel Fused images: (a) Vertical shift of input images to FusionGAN. (b) Significant vertical shift of input images leading to complete UAV overlap miss on Pixel Fused images.
Figure 14. Alignment failure on Pixel Fused images: (a) Vertical shift of input images to FusionGAN. (b) Significant vertical shift of input images leading to complete UAV overlap miss on Pixel Fused images.
Drones 08 00650 g014
Figure 15. Data fusion detection and tracking on the intra-class variation target case: (a) EO-IR architecture. (b) IR-EO architecture. (c) Pixel-level fused architecture.
Figure 15. Data fusion detection and tracking on the intra-class variation target case: (a) EO-IR architecture. (b) IR-EO architecture. (c) Pixel-level fused architecture.
Drones 08 00650 g015
Figure 16. Data fusion detection and tracking with the presence of birds target case: (a) EO-IR architecture. (b) IR-EO architecture.
Figure 16. Data fusion detection and tracking with the presence of birds target case: (a) EO-IR architecture. (b) IR-EO architecture.
Drones 08 00650 g016
Figure 17. Data fusion detection and tracking on the textured background target case: (a) EO-IR architecture. (b) IR-EO architecture. (c) Pixel-level fused architecture.
Figure 17. Data fusion detection and tracking on the textured background target case: (a) EO-IR architecture. (b) IR-EO architecture. (c) Pixel-level fused architecture.
Drones 08 00650 g017
Table 1. Sensor calibration results.
Table 1. Sensor calibration results.
SensorCalibration MatrixDistortion Coefficients
EO 2.5609 × 10 3 0 313.4439 0 2.9276 × 10 3 360.1981 0 0 1 {2.1355, −5.5289, 1.5735 × 10 4 , 0.1064}
IR 2.4726 × 10 3 0 283.0549 0 2.7449 × 10 3 194.0649 0 0 1 {−0.3653, 23.3465, −0.0247, −0.020}
Table 2. YOLOv7 test set results for 500 epochs.
Table 2. YOLOv7 test set results for 500 epochs.
ModelPrecisionRecall[email protected]mAP@[.5:.95]
EO0.8600.8270.8390.599
IR0.8940.8460.8850.656
IR BR0.8870.8600.8860.646
Pixel Fused0.8930.8730.9000.622
Pixel Fused BR0.8860.8660.8960.614
Average0.8840.8540.8850.627
Table 3. YOLOv7-tiny test set results for 300 epochs.
Table 3. YOLOv7-tiny test set results for 300 epochs.
ModelPrecisionRecall[email protected]mAP@[.5:.95]Time per Image Variation (%)
EO0.8560.8130.8230.544−47.6
IR0.8770.8350.8730.615−67.1
Pixel Fused0.8780.8550.8720.550−55.7
Average0.8700.8340.8560.570−56.8
Table 4. Independent model testing on the detector and tracker for Flight Experiments A and C.
Table 4. Independent model testing on the detector and tracker for Flight Experiments A and C.
FEDataRangeNr. of FramesPrecisionRecallFrame Rate (fps)IDS per 100 Frames
AEOclose27360.9560.98397.63.408
medium57210.9620.995101.32.483
far46450.9370.878101.23.896
AIRclose27360.9400.97799.25.783
medium57210.9600.963103.04.398
far46450.9050.955100.411.779
CEOclose22260.9500.638105.15.809
medium86800.9860.770101.81.501
far90870.9860.634105.01.734
v. far28120.9320.111117.01.798
CIRclose22260.9640.455111.41.565
medium86800.9430.773103.94.354
far90870.9190.717104.89.071
v. far28120.9280.245115.83.872
Table 5. Data fusion testing on the detector and tracker for Flight Experiments A and C.
Table 5. Data fusion testing on the detector and tracker for Flight Experiments A and C.
FEDataRangePrecisionPrecision Variation (%)RecallRecall Variation (%)Frame Rate (fps)IDS per 100 Frames
AEO-IRclose0.999+4.30.979−0.491.10.000
medium0.999+3.70.988−0.793.10.494
far0.996+5.90.951+7.382.23.885
AIR-EOclose0.992+5.20.979+0.389.80.395
medium0.997+3.70.989+2.692.40.681
far0.992+8.70.952−0.384.13.007
CEO-IRclose0.999+4.90.634−0.481.50.679
medium0.999+1.30.808+3.887.20.427
far0.995+0.90.719+8.478.91.662
v. far0.994+6.10.182+7.176.11.180
CIR-EOclose0.995+3.10.572+11.777.10.679
medium0.992+4.90.822+4.987.00.532
far0.989+7.00.752+3.581.71.431
v. far0.975+4.70.241−0.475.80.600
CPixel Fusedclose0.943−0.70|−2.100.429−21.0|−2.70113.44.253
medium0.940−4.50|+0.100.577−16.5|−16.9107.45.219
far0.954−3.30|+1.700.413−15.8|−15.5113.72.190
v. far0.984+5.20|+5.600.123+1.30|−12.1117.10.397
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pereira, A.; Warwick, S.; Moutinho, A.; Suleman, A. Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation. Drones 2024, 8, 650. https://doi.org/10.3390/drones8110650

AMA Style

Pereira A, Warwick S, Moutinho A, Suleman A. Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation. Drones. 2024; 8(11):650. https://doi.org/10.3390/drones8110650

Chicago/Turabian Style

Pereira, Ana, Stephen Warwick, Alexandra Moutinho, and Afzal Suleman. 2024. "Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation" Drones 8, no. 11: 650. https://doi.org/10.3390/drones8110650

APA Style

Pereira, A., Warwick, S., Moutinho, A., & Suleman, A. (2024). Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation. Drones, 8(11), 650. https://doi.org/10.3390/drones8110650

Article Metrics

Back to TopTop