1. Introduction
The study and conservation of wildlife have entered a new era with the integration of advanced technologies, providing researchers with unprecedented tools to monitor and understand animal behavior in their natural habitats [
1]. Among these technologies, unmanned aerial vehicles (UAVs) have emerged as versatile platforms for ecological research, offering the ability to capture high-resolution imagery and video from vantage points that were once inaccessible [
2].
In the past, wildlife avian surveillance of the Amvrakikos Gulf [
3] was carried out through physical monitoring or by utilizing telescopes.However, both methods inappropriate for surveying wildlife. Physically approaching islets to monitor birds caused stress, potentially leading them to break their eggs and suffer long-term detrimental effects on their population. Meanwhile, using a telescope may keep animals unharmed, due to the distance from the islets to the available land, this can also pose challenges due to the angle and visibility, making it even more difficult to survey the behavior of the wildlife.
In the realm of ornithological research, the utilization of UAVs holds immense potential for tracking and studying bird species in their natural environments [
1,
4]. When UAVs are coupled with automated computer vision methods, particularly those based on deep learning architectures, researchers can automate the identification and tracking of bird species in drone-captured footage, thereby overcoming the limitations of traditional manual tracking methods and eliminating the dangers of physical monitoring or the limitations of using a telescope.
Figure 1 depicts an image retrieved above an islet at a safe altitude and containing wild birds.
During the surveillance of the islands in the Amvrakikos Gulf, we acted in accordance with precise protocols based on the nesting seasons [
5] of Dalmatian pelicans (
Pelecanus crispus) [
6]. Their nesting seasons typically span from mid-January to mid-June. We ensured strict compliance with the safety protocols while surveying each island, to refrain from disturbing the wildlife during their breeding season. Ensuring the safety of the avian wildlife of the Amvrakikos remained our uppermost concern, simultaneously our objective was to conduct a comprehensive survey of the wildlife population. This effort aimed to facilitate the development of action plans aimed at mitigating any damages from avian influenza that occurred in the past [
7], leveraging the collection of knowledge and various tools throughout the study.
During each flight over the islets, two key parameters added complexity to our computer vision task. The drone’s movement from one point of the island to another was continuous, in addition to some target objects being in motion. This added complexity posed challenges for this task; however, it also allowed us to track wildlife across the entire island in a single sweep. Computer vision with UAVs raises new questions, such as the effects of camera motion, also known as motion blur, resulting in less than optimal performance of the detection model.
Figure 2 illustrates our flight plan methodology over wildlife nests, alongside details regarding the altitude safety protocols.
Some additional challenges we faced in using drones for wildlife surveillance included the interaction with the birds. The lower the altitude and the louder the drone, the more likely the birds would be disturbed, either by flying away or moving away from the drone. Furthermore, if birds feel threatened, they may collide with or attack the drone, creating a hazardous situation. In addition to these challenges, the ’Drone-vs-Bird Detection Grand Challenge’ [
8] highlighted the complexity of distinguishing drones from birds in video sequences, especially when drones operate in bird-populated environments, further complicating wildlife surveillance efforts.
The higher the altitude of the drone from the ground, the greater a challenge it became for the model to precisely track wildlife. Furthermore, detecting small- to medium-sized objects from high-resolution footage posed a challenge, due to the diverse number of factors that needed to be considered.
However, traditional tracking methods have often been limited by their reliance on manual annotation and labor-intensive data processing. The advent of computer vision algorithms, particularly those based on deep learning architectures, has alleviated these challenges. These methodologies empower researchers to automate the identification and tracking of bird species in drone-captured footage, providing a more efficient and scalable solution.
Recent years have witnessed a surge in the application of state-of-the-art computer vision techniques for object detection, recognition, and tracking [
9]. Notably, frameworks such as you only look once (YOLO [
10]), single-shot multibox detector (SSD [
11]), and faster R-CNN (region-based convolutional neural network [
12]) have demonstrated remarkable success in real-time object detection, laying a solid foundation for their adaptation to ecological research.
The current state-of-the-art models in computer vision for avian tracking involve the integration of convolutional neural networks (CNNs) trained on extensive datasets of annotated bird imagery. These models excel at recognizing complex patterns and shapes, enabling the accurate identification and tracking of individual birds or groups within a given scene [
13]. By harnessing the power of deep learning, researchers can extract detailed information about bird movements, spatial distributions, and social interactions from vast amounts of aerial footage.
Previous studies that explored the application of state-of-the-art computer vision models on avian datasets such as VLIZ [
14] examined footage or images similar to the ones in our study. Another study that examined the same problem under a specialized dataset such as ours focused its vision problem on high-altitude top-down detection of wild birds [
9] using computer vision techniques up to YOLOv3 [
15]. However, many of these studies did not employ remote wildlife surveillance using drones from a bird’s-eye view perspective. Furthermore, datasets that focus on wild birds such as the Macaulay Library [
16] and Caltech [
17] are widely known datasets with the vast majority of bird species. While both serve as a valuable reference for our use case, our approach was more specialized, necessitating consideration of several additional parameters to effectively detect and track wildlife using drones. We focused our attention on modern technologies such as the detection models YOLOv7 [
18] and YOLOv8 [
19] and investigated their suitability for our objective.
Achieving high precision, as emphasized in similar studies [
20], is a challenging task. In this work, we leveraged recent advancements in the field and explored cutting-edge deep learning architectures and techniques specifically designed for high-precision inference in a custom dataset featuring small objects.
The dataset we built consisted of high-resolution drone footage of about 3840 × 2160 109 pixels (width × height). Feeding such large images directly into any model with a significantly smaller input size, typically around 640 × 640 pixels (Width × Height), results in a substantial loss of information. This is due to the model automatically scaling the image to fit the input size of the model. Our objective was to make the models capable of using their full potential, without any loss of information during inference.
Figure 3, illustrates the information loss of scaling an image down to the model’s network size. We can observe a significant loss of information and detail in the scaled image compared with the full-scale image.
We built upon the use of detection models for the task of identifying diverse scales of objects, using high-precision detection methods. Additionally, we explored the use of multi-object tracking (MOT) techniques. Our objective was the continuous tracking of multiple individual objects across multiple frames in a video. Object tracking endeavors to assign and maintain a unique identifier (ID) across multiple frames, as established in [
21].
The complex interplay between object detection and tracking is of great importance for successful multi-object tracking. While robust object detection is the foundation of this task, various factors beyond detection call for consideration [
22]. One such crucial aspect is the ability of the tracker to sustain an object’s ID, even when the object detection model fails to detect an object between consecutive frames; this technique is called re-identification. Re-identification aims to compensate for any data loss that may occur during inference due to missed detections by the object detector.
This manuscript delves into the intricacies of applying cutting-edge computer vision methods to avian wildlife monitoring, elucidating the nuances of YOLO and other relevant frameworks in the context of UAV-captured data. The subsequent sections will expound upon the specific adaptations and optimizations carried out to tailor these algorithms for the challenges posed by bird tracking, considering factors such as varying lighting conditions, diverse bird species, and complex natural environments.
As we navigate through this exploration of computational vision tools, our aim is to not only showcase their current capabilities but also to inspire further innovation in the realm of ecological monitoring. The synthesis of UAV technology and advanced computer vision methodologies not only augments the precision of avian tracking but also opens avenues for interdisciplinary research at the intersection of computer science and ecology, fostering a deeper understanding of avian behaviors and ecological dynamics.
The overarching goal of this research is to contribute to the growing body of knowledge on avian ecology by leveraging the capabilities of UAVs and sophisticated computer vision algorithms. As biodiversity faces increasing threats and challenges, understanding the dynamics of wildlife populations becomes crucial for effective conservation strategies. This manuscript elucidates the development and application of novel computational tools designed to track and analyze wild bird species, with a focus on their movements, group dynamics, and habitat preferences.
Through this interdisciplinary approach, merging insights from ecology, computer science, and remote sensing, our research contributes to the advancement of wildlife monitoring methodologies. The findings presented herein not only offer valuable contributions to the scientific community but also pave the way for enhanced conservation strategies that are grounded in a deeper understanding of avian behaviors and ecological interactions.
Our study presents several key contributions to the field of wildlife monitoring using advanced computer vision techniques:
We introduce a novel state-of-the-art computer vision model, ORACLE, designed to enhance the accuracy and efficiency of wildlife bird tracking from drone footage.
The ORACLE model demonstrated exceptional object detection capabilities, with a mean average precision (mAP) of 91.89% at 50% intersection over union (IoU), addressing the challenge of detecting small- to medium-sized wildlife from high altitudes.
Our methodology incorporates advanced multi-object tracking techniques that maintain consistent identification numbers across frames, which is crucial for long-term behavioral studies and population monitoring.
ORACLE facilitates detailed behavioral and population analytics, which are critical for conservation efforts, providing environmentalists and researchers with valuable insights into wildlife dynamics.
The application of our model extends to remote and inaccessible regions, demonstrating its robustness under challenging environmental conditions where traditional monitoring methods are not feasible.
Furthermore, the impact of avian influenza in 2022 on Dalmatian pelicans was a severe disaster [
7]. Our goal is to observe the wildlife and the environment of the Amvrakikos Gulf in Greece and to conserve it in the long run with the assistance of our model.
2. Materials and Methods
This section provides extensive information regarding the methodologies and implementations used to develop not just ORACLE, but also a high-accuracy computer vision model capable of a robust object detection system. Our primary emphasis lay on addressing the challenges of a specialized problem regarding computer vision in high-resolution drone footage with small- to medium-sized objects.
2.1. YOLO Models and Their Performance
A key objective of this research was not only to inform but to conduct an excessive review of several detection models. We chose to focus specifically on YOLO (you only look once) models for several reasons. Firstly, YOLO models are renowned for their efficiency and speed in object detection tasks, making them well-suited for real-time applications. Additionally, YOLO architectures have demonstrated strong performance across various datasets, including the challenging Microsoft COCO dataset [
23], which contains a wide range of object classes and scenarios.
We aimed to thoroughly assess the accuracy and performance of different YOLO models. To achieve this, we conducted extensive evaluations using various YOLO architectures and sizes on the Microsoft COCO dataset. This dataset is widely recognized and utilized for benchmarking object detection algorithms, due to its large-scale and diverse collection of images. By testing multiple YOLO models on the COCO dataset, we were able to gather comprehensive insights into their capabilities and limitations. This rigorous evaluation process allowed us to compile a comprehensive set of results, enabling us to make informed decisions about which models performed best under different conditions and tasks.
In the following section, we explored several models with various sizes and architectures by evaluating them under the Microsoft COCO dataset and ultimately compiled this information into a comprehensive set of evaluation results.
The YOLO models utilize a size-based naming convention for their sub-models: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x (extra large). Furthermore, some frameworks contain additional architectures to train based on the specialization of the dataset. The integration of an architecture in a model may vary based on the problem in question. Architectures such as P6 prioritize medium- to large-sized detections, whilst P2 and P3 architectures prioritize small- to medium-sized detections. Ultralytics contains an active repository with these architectures available and ready to use at any time, from P2 up to P7.
Due to the great number of models already available for use, it is difficult to determine which model fits best for a specialized dataset such as ours. As such, this part of the study delved into a comparative analysis of the models (v5, v7, and v8), including their sub-models. As mentioned previously, we evaluated their overall performance (mAP) on the MS COCO dataset [
23] to clarify several key points in employing specific model sizes or architectures in comparison to the COCO dataset and ours.
The MS COCO dataset consists of annotations sized from small to extra-large objects.
Table 1 describes the annotation size distributions in COCO. The most dominant size range of annotations are the small ones at a total of 41.43%, whilst the large detections consist of 24.25% of the overall dataset.
In all our experiments, we made sure to evaluate each model with little to no imbalances. We achieved this by manually tuning some of the settings during the evaluation of the models. This was achieved by using the same confidence and IoU threshold [
25] values for the post-processing function non-maximum-suppression (NMS) [
26].
All models listed in
Table 2 were evaluated using the pycocotools library from Microsoft COCO [
23]. It should be noted that many YOLO models are distributed across multiple frameworks, such as Darknet [
27], Ultralytics [
19], and PyTorch [
28]. This diversity might produce slight variations, typically within a range of 1–2%, in the evaluation results.
The evaluation results of YOLOv5, YOLOv7, and YOLOv8 in
Figure 2 serve as an indication of the overall performance of each model on a dataset such as MS COCO [
23].
The trained models of YOLOv5 with the P6 architecture produced satisfactory results, with YOLOv5x6 performing with the highest overall mAP compared to the other models in the COCO dataset. The sub-models of YOLOv5 with P6 prioritize medium, large, or extra-large detections, which explains the high accuracy on the COCO dataset.
YOLOv7 and its sub-models demonstrated optimal performance when utilizing the P6 architecture. However, it is noteworthy that large-sized models may exhibit much less improvement in overall mAP compared to some of the smaller-sized models. Despite that, larger-sized models play a crucial role in achieving extensive detection results in certain challenging tasks.
In contrast to YOLOv7, which contains a limited number of architectures for small-object detection, YOLOv8 boasts a diverse range of models optimized for this task. We directed our focus towards YOLOv8 due to its exceptional efficiency in dynamically detecting small- to medium-sized objects with minimal inconsistencies.
Moving forward, our study expanded further in a methodology regarding small object detection. We examined the techniques employed to develop a tiling algorithm, as well as the challenges encountered alongside this technique. Subsequently, our study evaluated several models using a custom dataset that prioritizes high-resolution images with small-sized detections.
2.2. ORACLE-Five-Layered Model for Advanced Object Detection & Tracking
Our efforts in implementing a model capable of advanced object detection and tracking, as well as the extraction of valuable analytics, began by developing a five-layered model. During the development of this study, we implemented an advanced algorithm capable of processing footage recorded using high-grade industrial drones. The model’s objective is to receive as input, footage recorded in wildlife environments and to extract important analytics observed from the wildlife, in the form of statistics of data visualizations. This model targets data analysts and environmentalists, to process and understand the activity of ecosystems.
ORACLE is a five-layered model capable of applying advanced object detection and tracking techniques, along with the extraction of valuable information and analytics from drone footage. It is capable of extracting data, such as estimating counts of wildlife, exporting videos using various visualization methods, adding masks to distinguish an object from the background, and more. ORACLE uses state-of-the-art computer vision models prioritizing small- to medium-sized objects.
Figure 4 illustrates the sequential processes executed by our model to facilitate the extraction and exportation of significant analytics.
The pre-processing layer works by dynamically loading the models based on their individual framework. Additionally, each frame is enhanced with the use of the gamma correction technique [
29], to reduce sunlight degradation. The footage is loaded in the form of a dataset. Each frame is processed individually and is later tiled during inference under a tensor that takes advantage of cuda technology [
30], in order to achieve faster inference [
31].
Figure 5, depicts a frame loaded into the dataset. This image is later processed in the second layer, the object detection layer.
The object detection layer works by selectively applying inference based on the model version. This layer processes each frame individually, and this is repeated throughout the entire video. Each frame is segmented into multiple pieces and then passed onto the detection model [
32].
This layer was developed to dynamically utilize specific inference functions based on the model version and the platform it was published in. Models published and developed in Ultralytics utilize SAHI inference [
32]. Models published and developed in PyTorch take advantage of our tiling algorithm based on DarkHelp [
33].
Figure 6 depicts the results using our tiling algorithm for most PyTorch models.
The post-detection processing layer merges the detections that were split during inference with tiling. This layer attempts to merge the detections that were split due to the image tiling algorithm, as depicted in
Figure 6b.
The post-detection processing layer is only activated when using a PyTorch model to merge any detections that were split during inference due to tiling.
Figure 7b depicts the result of the post-detection processing layer, which attempts to merge the two detections resulting from the previous layer,
Figure 6b.
The tracking layer is dedicated to the tracking model, which handles the outcomes derived from the previous layer. Its objective is to assign an identifier to each detected object and consistently retain this identifier across multiple frames.
Figure 8 showcases the visible results of the model, with each detection displaying information such as the class name, identification number, and prediction confidence.
In
Figure 8, a green mask overlays the object we tracked. The goal of creating an overlay of the object is to target pixels that depict the tracked object’s surface. Mask overlaying allows us to process those pixels in the future and give us a more in-depth analysis (e.g., targeting the exact temperature of the green pixels). We achieved this technique by applying color thresholding to each detection individually using a specific pixel value distinguishable on the object’s surface compared to the background. As evident in the
Figure 8b, there is a noticeable contrast that makes the masking process far easier.
The post-processing layer is the final layer where ORACLE generates visualizations, as well as extracting valuable information regarding the environment depicted in the footage. In this layer, ORACLE uses several algorithms capable of producing insightful information from the data. An example of a post-processing layer algorithm is the estimation of wildlife detected within the footage.
We can observe from the previous
Figure 6,
Figure 7 and
Figure 8 that during visualization we created a zoom effect for the oldest active tracks based on life-span sorted by ID. This visualization method was primarily used for better observation of the objects.
Figure 9 displays three tracked objects in four different timelines of a video. With the application of advanced object-detection techniques alongside tracking methods, we can observe how those models were capable of consistently maintaining the same track IDs for the same objects over extended periods of time.
Using ORACLE, we produced a video demonstrating the development of the model
ORACLE Video Result.
2.3. Image Tiling
Our approach involves segmenting each full-scale image into smaller tiles and feeding them individually into the model. This methodology is consistently applied during both the inference and training stages, to sustain the same dimensions. However, existing libraries like PyTorch offer limited algorithms for specific models like YOLOv7. To address this issue, we developed our own custom tiling algorithm inspired by the implementation in DarkHelp [
33]
During inference, the image is segmented into multiple tiles, and each tile is processed by the model to generate individual results. Subsequently, the tiled detections are stitched back together to reconstruct the complete image. This approach not only reduces memory consumption but also ensures that no information is lost during inference.
Given that our target network size is 640 × 640, a 4 k image should ideally be tiled approximately 20 times. However, considering the dimensions of the image and the network size, a portion of the image will not be present in the tiles
. To address this issue, we padded the remaining information of the image, to compensate for any information that might be lost during inference.
Figure 10 visually demonstrates this process.
Image tiling maximizes the model’s potential for higher-resolution images; however, many challenges are faced when using this technique. The larger the resolution of the images, the more tiles are segmented, leading to slower inference times.
A substantial issue we faced using image tiling was when we attempted to segment an image into multiple tiles and subsequently load each tile individually into the model for inference. This led to detections being split across tiles when the image was reconstructed. To compensate for this issue, we developed a detection merging technique based on the image tiling implementation in DarkHelp [
33]. This detection merging technique works by sorting the candidate and non-candidate detections. Candidate detections,
Figure 11, are the detections that are closest to the tiles based on a tile edge factor. The default tile edge factor was set to
0.3.
The determining factor for if a detection is near the edge of a tile is calculated using the distances of the detection in all directions, as opposed to the distance to the edges of each tile. If either of these distances are smaller than a minimum horizontal/vertical threshold, then the detection is a candidate. Equation (
1), depicts the method used to calculate the minimum threshold values, including the distances.
Det_width/height represents the width or height of the detection. TileEdgeFactor by default was set to 0.3. Top/left/bot/right distances were calculated based on the tile location and
.
After calculating the relative distances of the detections to the tiles, they are compared with each other to determine whether detection is a candidate or non-candidate. Once the candidate detections have been found, they are then merged based on a merging factor, that is typically set to
1.35. Equation (
2) (defines lhs_plus_rhs as the subtraction of the areas of the lhs and rhs bounding boxes, multiplied by a merge factor of 1.35, whereas union refers to the union of the two rectangles provided by the detections):
Finally, the logic behind whether two detections will be merged is determined using Equation (
3).
Figure 12 is a visual representation describing the comparison across several detections. If two detections are in proximity to the edges of two tiles and their combined areas times a factor is greater than the union of both, then the detections are merged, otherwise the rhs detection is not merged with the lhs detection. The red squares outline the slicing of two tiles, the green rectangles represent the split detections resulting from image tiling, whilst the orange-dotted rectangle represents the union of the two detections in question.
Additionally, the pseudocode Algorithm 1 of the merging technique thoroughly describes the analytical steps taken to merge the candidate detections.
Algorithm 1 Detection Merging |
- 1:
function merge_detections(detections, tile_rect_factor) - 2:
result ← [] - 3:
checked_indices ← {} - 4:
for each lhs_index, lhs_pred in detections do - 5:
if lhs_index is in checked_indices then - 6:
continue - 7:
end if - 8:
append lhs_index to checked_indices - 9:
merged ← False - 10:
for each rhs_index, rhs_pred in detections do - 11:
if rhs_index is in checked_indices or rhs_index equals lhs_index then - 12:
continue - 13:
end if - 14:
calculate lhs_area and rhs_area - 15:
create union_rect from lhs_rect and rhs_rect - 16:
calculate union_rect_area - 17:
lhs_plus_rhs ← (lhs_area + rhs_area) ∗ tile_rect_factor - 18:
if union_rect_area ≤ lhs_plus_rhs then - 19:
create new_pred based on union_rect, lhs_pred and rhs_pred - 20:
append new_pred to result - 21:
mark rhs_index as checked - 22:
merged ← True - 23:
break - 24:
end if - 25:
end for - 26:
if not merged then - 27:
append lhs_pred to result - 28:
end if - 29:
end for - 30:
return result - 31:
end function
|
Upon completion, each combined detection, alongside the detections with no neighboring detections for merging, is appended to a detection results list. An illustration of this technique can be observed in
Figure 13.
2.4. Model Fine-Tuning
Model fine-tuning is a standard process of developing high-precision detection models, such as the one we aimed to achieve for small object detection. Wildlife surveillance via drones poses a great challenge, due to the variety in the shapes and forms in which the birds may appear (e.g., spread/wings, different angles, different sizes, etc.). Fine-tuning aims to address these issues by enhancing the quality and quantity of the dataset and training, thereby improving the accuracy of the detection models.
There are two common fine-tuning methods: one involves enhancing the dataset through image augmentation techniques and the other focuses on fine-tuning the model by adjusting the augmentation parameters used during training. In addition to these, transfer learning is another method that significantly improves model performance by adapting pre-trained models to new tasks (datasets).
Our first approach in fine-tuning the YOLO detection model was through the application of transfer learning [
34]. This method significantly enhances a model’s accuracy by refining its detection capabilities from one dataset to another. Transfer learning is generally more effective than training from scratch, leading to substantial improvements in a model’s overall performance.
Our second approach involved further enhancing the dataset through image augmentation techniques. Specifically, the application of zoom and crop significantly increased the dataset size by applying random zooms or crops to the images. The resulting images were resized to match the network’s input size, ensuring consistency across all images.
Our final approach was to adjust the hyper-parameters during training. Training a detection model such as YOLOv8 [
19] allows users to manually edit augmentation parameters. Our primary focus was to adjust some of these parameters to achieve higher performance in the task of detecting small objects.
Our final approach was to tune the augmentation hyper-parameters during training, with the goal of increasing accuracy. In the case of small object detection, adjusting certain parameters will assist the model to either have more or less variety during training, therefore increasing the performance of various detection models. Parameters such as mosaic, scale, and fliplr [
35] generally have a positive effect in creating more variety during training; however, mixup will negatively impact the accuracy for small object detection tasks. Our primary focus was to adjust the parameters that applied augmentation to the image.
Table 3 depicts the various adjustments to the hyper-parameters made across the frameworks.
Figure 14 displays the training batch of 16× images used to train the detection models whilst utilizing various augmentation hyper-parameters such as mosaic.
2.5. Multi-Object Tracking Models
During inference, we used the highest performing advanced multi-object tracking, (MOT) model OSNet [
36,
37]. OSNet was trained on the MSMT17 dataset [
38], which consists of a combination of various datasets perfect for our use-case of re-identification. They were trained on a large-scale dataset with small to medium-sized detections perfect for our use case. The tracker we used for all our use cases was DeepOC-Sort [
39], an advanced algorithm prioritizing MOT with re-identification.
Table 4 describes the evaluation results of the various Osnet model sizes under the dataset MSMT17 [
38].
Integrating a tracker into our inference algorithm greatly increased the depth and value of information we could extract, instead of relying solely on an object detector to extract information such as an estimated number of objects visible in a video.
Simply using an object detector fell short in pinpointing the precise detection that we wanted to target within a frame. In short, the amount of information provided by an object detector is insufficient to pinpoint individual detections across frames. The only information gathered from an object detector is the coordinates (x1, y1, x2, y2), the confidence, and the class name. This indicates that the information provided solely by a detector is insufficient to track detections. However, the information provided by the detector is sufficient to append to a tracker such as DeepOC-Sort. Using a tracker allows us to not only identify a specific detection on the screen but also track the same detection across multiple frames. During the inference of DeepOC-Sort, we can extract the same amount of information as the object detector, but this time each detection now contains an identification number.
2.6. Tools & Equipment
The specifications for this study included a high-end computer equipped with an RTX 4090, which allowed us to train detection models quickly and run ORACLE at high speeds.
A low-noise drone was used for multiple flight plans over several days, to avoid disturbing the wildlife. The drone was equipped with a rotating gimbal camera that records up to 4 k videos and can fly for up to 30 min in ideal conditions.
Additional tools for this research included DarkMark [
40], an advanced labeling tool used to develop, improve, and augment our dataset, which features automatic labeling, dataset statistics reviewing, and more.
5. Conclusions
Our objective was to automate the task of surveying remote and inaccessible environments, without relying on manual labor. We initiated the process by securely retrieving footage suitable for training or evaluating detection models. However, merely employing object detection techniques falls short in comprehensively “surveying” a remote area and extracting valuable analytics. Thus, during the course of this study, we devised a sophisticated model titled ORACLE (optimized rigorous advanced cutting-edge model for leveraging protection of ecosystems). ORACLE not only performs various AI tasks related to computer vision, but also automates the task of surveillance and facilitates the extraction of valuable analytics.
This study delved deeper into the advancement of wild bird surveillance with the assistance of drones. Our previous study [
13] focused on the development of a dataset capable of detecting small-sized objects using YOLOv4 [
42] and YOLOv4-tiny. During inference, we achieved a total of 91.28% mAP with our previous dataset. However, this study advanced even further in the task of object detection, by significantly improving both the quality and quantity of the dataset.
In our new dataset, named after our project name, the AMVRADIA dataset, we achieved a peak accuracy of 95.96% (evaluated under 50% IoU), as depicted in
Table 10, using a large-scale model that prioritized small- to medium-sized detections, YOLOv8x-p2 [
19]. Moreover, we improved this study’s quality using tracking techniques, with the assistance of the Deep OC-SORT [
39] algorithm under the OS Net model [
36].
Since tracking is a process that goes hand-in-hand with object detection, we evaluated both detection and tracking models on the same scale, to show a robust correlation of our evaluation results, as depicted in
Table 11. This evaluation methodology proved the technicality and impact of the performance of our tracks using several detection models. It also proved how the detection layer is the basis of our surveillance model ORACLE and its overall performance.
In addition to the detection and tracking layers, which facilitate robust data extraction, we implemented and presented various algorithms for information extraction and visualization. Among these, the primary algorithm was our object count estimation algorithm (referenced as Item 1. While the implementation of this algorithm may appear straightforward, its accuracy is heavily dependent on the performance of both the detection and tracking layers.
Finally, this study concluded with the use of high-level state-of-the-art detection and tracking models that facilitate the task of data extraction from drone footage. We optimized and fine-tuned both of these techniques to a state of near-perfect accuracy. We not only evaluated the models but also developed algorithms capable of extracting valuable information, such as a wildlife estimation counts.