1. Introduction
Effective management of threatened and invasive species requires regular and reliable population estimates, which in turn depend on accurate detection of individual animals [
1,
2,
3,
4,
5,
6,
7]. Traditional methods of monitoring, such as conducting surveys along transects using ground-based experts, can be expensive and logistically challenging [
3,
8,
9,
10,
11,
12]. Surveying the large areas required for robust abundance estimates in a cost-effective way is also problematic [
3,
10]. In response to this, drones (also known as remotely piloted aircraft systems (RPAS) and unmanned aerial vehicles (UAV)) are rapidly being recognised as efficient and highly effective tools for wildlife monitoring [
8,
10,
11,
13,
14,
15,
16]. Drones can cover large areas systematically, using pre-programmed flight paths [
16] and carry sensors that capture data at a resolution high enough for accurate wildlife detection [
10,
13,
14,
17], even for relatively small mammals such as koalas [
18]. In addition, drones cause less disturbance to wildlife than traditional ground-based surveys [
13].
The large volume of data resulting from covering tens to hundreds of hectares is difficult to manually review, but machine learning is providing solutions that are faster and more accurate than manual review [
10,
16,
19,
20,
21]. Deep learning architectures, in particular, are now commonly used for object recognition from images [
8,
10], and foremost among these are convolutional neural networks (CNNs) which are deep learning algorithms that progressively learn more complex image features as they progress through deeper network layers [
8,
10]. CNNs have been used by ecologists to detect a range of species from RGB imagery, including African savanna wildlife [
22], polar bears [
23] and birds [
8,
24]. CNNs have also been used to detect elephants from satellite imagery [
25]. While large bodied or otherwise easily detected species are relatively well studied, the accurate detection of animals against complex backgrounds has proved to be more challenging [
10,
16,
18]. This is particularly true for drone surveys of small arboreal creatures, as the combination of low-altitude and wide-angle imagery can amplify problems with occlusion, and target animals tend to be dwarfed by background scenery [
26,
27,
28].
A study published in 2019 notably achieved high accuracy for koala detection in complex arboreal habitats by fusing the output of two common CNNs (YOLO and Faster R-CNN) over multiple frames [
18], which can be considered a primitive form of ensemble learning. Ensemble learning is well established in the computer vision community and involves the integration of multiple deep-learning algorithms into single, larger detector models, exploiting the inherent differences in the capabilities of different model architectures while minimising the weaknesses [
29,
30,
31,
32,
33]. Ensemble learning improves model performance [
29], increases predictive inference [
31,
34] and reduces model-based uncertainty compared to a single model approach [
32,
35]. In addition to using two models, the approach used by [
18] aggregated detections across frames by aligning consecutive frames using key-point detection. Detections were then derived from a resultant ‘heat-map’ that captured areas that recorded repeated detections from the two models over a short time span. While this approach reduced false positives, it may have also excluded animals that were not continuously detected in densely vegetated areas. Additionally, the frame alignment process was potentially prone to error because of the low contrast nature of thermal data.
Despite its potential, the application of ensemble learning in the ecological literature is sparse. Ensembles have recently been applied to image analysis tasks, such as classifying cheetah behaviours [
36] and multilevel image classes [
30], and for the re-identification of tigers [
37]. But apart from [
33], who used ensemble learning to identify empty images in camera trap data, there has been little exploration of the enhanced predictive and computational power of ensembles for detection of wildlife from remote sensing data. This represents a considerable opportunity to the ecological community. As ecology becomes more data-intensive, there is an increasing need for methodologies that can process large volumes of data efficiently and accurately [
10]. Applying suitable object detection ensembles to low-altitude, drone-derived data has the potential to increase the accuracy, robustness and efficiency of wildlife detection.
In this study, we extend the method devised by [
18], replacing the fusion of two CNNs running in parallel with ensembles of CNNs that run simultaneously. In doing so we move away from the temporal approach, which was devised due to limitations in the number of models that could be simultaneously run, and shift towards a stand-alone ensemble that allows for improved detection within a single frame. We systematically construct and analyse a suite of model combinations to derive ensembles with high potential to increase recall and precision from drone-derived data.
2. Materials and Methods
2.1. Data Preparation
A corpus of 9768 drone-derived thermal images was collated from existing datasets. To assist with false positive suppression, the dataset included 3842 “no koala” images containing heat signatures that could potentially be misidentified as koalas, such as tree limbs, clustered points in canopies more likely to be birds, and signatures observed to be fast-moving and located on the ground. The corpus was split into subsets comprising 8459 images (86.5%) for use as a training dataset, of which 3332 contained no koalas, and 1309 images (13.5%) for validation, of which 510 contained no koalas. Pre-processing of the data involved identifying koalas within the images and manually annotating bounding boxes with the LabelIMG software. Most instances contained a single koala, which was expected given that koalas tend to be spatially dispersed [
16].
The data were collected in drone surveys at Coomera (September 2016), Pimpama (October 2016), Petrie (February to July 2018) and Noosa (August to October 2021) in south-east Queensland, and Kangaroo Island (June 2020) in South Australia. Koalas occur across a vast area of Australia, and these survey locations represent a wide sample of environments in which they live. All flights were conducted at first light, using a FLIR Tau 2 640 thermal camera (FLIR, Wilsonville, Oregon, United States of America) with nadir sensor mounted beneath a Matrice 600 Pro (DJI, Shenzhen, China), with an A3 flight controller and gyro-stabilized gimbal. Camera settings included a 9 Hz frame rate (30 Hz at Kangaroo Island), 13 mm focal length and 640 × 512 pixel resolution. Drones were flown at 8 ms−1, following a standard lawnmower pattern flight path with flight lines separated by 18 m. Altitude was set at 60 m above ground level to maintain a drone flight height of roughly 30 m above the top of the canopy. The sensor’s field of view at 25 m above ground level, just below maximum tree height, produced an image footprint of 39 m perpendicular to and 23 m along flight lines. At the time of survey, the GPS receivers in the Matrice 600 Pro had a horizontal accuracy of approximately 5 m. Receivers in the FLIR Tau 2 sensor had similar accuracy.
Data collected in a survey conducted at Petrie on 24 July 2018 were subsequently used for testing. The data comprised 27,330 images that were not included in the training and validation corpus, so that testing could be conducted on unseen data. On the morning of the survey, the area contained 18 radio-collared koalas which provided valuable ground-truth data. The survey had previously been analysed by [
18] with the CNN fusion approach, which yielded 17 automated detections.
2.2. Model Training
Detector models were based on state-of-the-art object-detection deep CNNs—tiny, medium, large and extra-large YOLOv5 v6.0 (
https://github.com/ultralytics/yolov5, accessed on 15 December 2021), and Detectron2 implementations (
https://github.com/facebookresearch/detectron2, accessed on 15 December 2021) of FR-CNN and RetinaNet. Existing models pre-trained on MS-COCO (a general-purpose object-detection dataset) were fine-tuned on the small corpus of training images using transfer learning, in which the previously learned weights were adjusted. Each CNN was trained multiple times, with each training run producing a unique set of weights due to the random order in which batches of data are fed to the model, resulting in subtle differences in performance, even for models of the same type and size.
Sixty models were trained in all—ten copies each of tiny, medium and large YOLOv5, 50-layer RetinaNet and 50-layer Faster R-CNN, and five each of extra-large YOLOv5 and 101-layer Faster R-CNN. Sixty individual models were considered sufficient to explore the effect of different combinations of model size, type and number without creating ensembles so large that processing would become unwieldy. YOLOv5 models were fine-tuned for 250 epochs each. Tiny and medium YOLOv5 models were trained on a batch size of 32, but this was reduced to 16 for large and extra-large YOLOv5 models due to constraints with available GPU memory. RetinaNet and Faster R-CNN models were fine-tuned over 100,000 iterations using the Detectron2 application programming interface (API). The koala detection methods employed a simple tracking approach, a simplification of the frame-to-frame camera transforms employed by [
18] that is enabled by the improved detections within a single frame. In the current approach, tracked objects were associated with geographical coordinates and the tracked camera positions were used to associate detections that occurred across sequential frames.
2.3. Detector Evaluation
A combinatorial experiment was conducted which involved first assessing the performance of each individual copy of each model type and size on the validation dataset. A detector evaluator tool was devised for this purpose which calculated the average precision (AP) achieved by each detector, based on Object-Detection-Metrics devised by [
38]. AP (also known as mean AP [mAP] when more than one class of object is detected) is the most common performance index for object detection model accuracy [
8,
28,
38,
39]. The tool enabled AP to be calculated in a consistent method, regardless of model type, and allowed a direct quantitative comparison of how well each individual detector performed on the validation dataset.
AP values of individual detectors were then used to inform the composition of a range of ensembles that were in turn run across the validation dataset so that AP could be calculated for each ensemble. As it was impractical to test all possible ensemble combinations, the principle of saturated design was applied so that analysis of additional combinations was discontinued when further improvements in AP appeared unlikely [
40]. Consideration was also given to the overall size and complexity of ensembles, which influence inference time.
Model predictions (detections) were assessed using a threshold known as ‘intersection over union’ (IoU) which overlays the area of a prediction with the area of a corresponding ground-truth (where there is one) and measures the proportion of the overlap. A threshold of 0.8 was applied, whereby detections with IoU greater than or equal to 0.8 were classed as true positives (TP) and those with IoU below 0.8 as false positives (FP). Annotated koalas that were undetected were classed as false negatives (FN). Precision and recall values were then calculated as shown in Equations (1) and (2), with precision indicating the proportion of predictions that were correct and recall giving the proportion of all ground truths that were detected [
38].
The detector evaluator also calculated precision vs. recall (P-R) curves that visualised the inherent trade-off between these two metrics. The curves were smoothed by a process of interpolation where the average maximum precision was calculated at equally spaced recall levels [
38]. The AP value of each detector was finally determined by calculating the area under the P-R curve, with a large area indicating good model performance where precision and recall both remained relatively high [
28].
2.4. Ensemble Creation
A range of detector ensembles were created from the individually trained copies of each model type and size. In the ensembles, following non-maximum suppression, detections were aggregated across component detectors, with overlapping detections grouped. A confidence score [0, 1] was calculated for each detection by the individual models, with 0 indicating no confidence and 1 indicating 100% confidence. The detection threshold was set at 0.5, so that initial detections with a confidence score below 50% were discarded, effectively dampening spurious detections. Same class detections were progressively merged based on the overlap of bounding boxes produced by the different detectors, with the Intersection-over-Union (IoU) threshold set at 0.8. This meant that detections from individual models within the ensemble with more than 80% overlap were progressively merged. The final detections output by the ensembles were based on the average of these merged detections and a final confidence score was given for each. This final confidence score is the sum of the confidence values for the individual grouped detections across the ensemble, divided by the total number of models in the ensemble.
Figure 1,
Figure 2 and
Figure 3 show examples of frames where objects have been identified by different detectors within ensembles and then tracked according to the confidence value assigned to the detection.
Compared to the earlier method devised by [
18] this approach has a number of benefits including avoiding the need to align consecutive frames to register detection results, which improves robustness in the presence of rapid camera motion, and offering a scalable solution where the complexity of an ensemble can be easily scaled by adding or removing detectors to meet diverse use cases.
The first group of ensembles comprised multiple copies of the same detector type and size. Individual copies were added one at a time, from highest to lowest individual AP, to allow the effect of each addition to be assessed. The next group of ensembles combined different numbers and sizes of all YOLO or all Detectron2 models exclusively. The AP values achieved by these ensembles informed the composition of the third and final group of ensembles, in which both types and sizes were mixed.
2.5. Ensemble Testing and Analysis
Four of the best performing ensembles were selected for testing on the unseen dataset. GPS coordinates associated with objects tracked by an ensemble indicated the position of the sensor on board the drone when the object was detected rather than the object itself. These locations were visualised in ArcGIS (v10.8) in order to identify instances where the same object was detected (duplicated) in multiple tracks. When visualised, duplicate tracks sometimes appear as a compact linear sequence of detections along the line of flight where continuous tracking has been interrupted by some occlusion. Duplicates can also occur in adjacent flight rows, where the same object is approached from opposite directions. We expect the number of duplicates to decrease as geolocation of target objects improves.
From analysis of the visualisation, the number of unique predictions was estimated for each ensemble. True positives were then confirmed by manual review against radio-collared koala locations, allowing precision (distinct from AP) and recall (probability of detection) values, and F1-scores to be calculated for each ensemble on the test dataset. An F1-score indicates the harmonic mean between precision and recall, which allows the overall performance to be evaluated in terms of the trade-off between the two important metrics, as shown in Equation (3):
4. Discussion
While technologies such as drones and thermal imaging are enabling advances in environmental surveying, ecologists are struggling to process the large datasets generated with these tools. This is especially true for small and cryptic wildlife which do not feature strongly in the literature, with the majority of automated detection research to date focussed on large-bodied mammals or birds that are easily discernible from relatively homogeneous backgrounds [
20]. The study by [
18] was the first to successfully apply automated detection to a small-bodied arboreal mammal from thermal imagery. While at the time the method was innovative for cryptic wildlife detection, deep learning has continued to evolve rapidly, and the algorithms used in that study are no longer cutting edge. While ensembles have been used in other domains, this is the first time that ensembles have been used in ecology for the detection of threatened species using drone-derived data. The deep learning ensembles provide much greater computational power, deriving valuable synergies from running suites of high-performance algorithms simultaneously. Our systematic study has devised a quantitative method for evaluating the combinations that achieve high precision and recall in small-bodied, arboreal wildlife detection and has demonstrated the utility of the approach. The results are strong in the broader computer vision context, with one article published in 2021 [
28] highlighting the lack of focus on small object detection and summarising mean AP values achieved from low-altitude aerial datasets as between 19% and 41%. Our results are particularly promising given the benefit that can be derived from further fine-tuning iterations which can prepare the ensembles for specific contexts and require only minimal datasets for training. While we did not expect our study to match the results of [
18] because our training data encompassed broader spatial and temporal settings, the ensembles nonetheless tested well with minimal training.
The shift to an ensemble approach offers a number of advantages that increase detector robustness. The ensembles are built from state-of-the-art models which perform better when detecting small targets, thus reducing the need to register frames for accumulating detections over time. It is also advantageous to use a larger number of simpler, faster models to estimate uncertainty with respect to detection. The approach can also be scaled for specific platforms without changing the underlying system, so that smaller and simpler ensembles can be employed when circumstances require, for example in an on-board setting, and larger ensembles can be used when more hardware is available.
The best performing ensemble contained the greatest diversity of component models, demonstrating the benefit of ensemble learning which exploits the various architectural strengths and minimises the weaknesses [
29,
30,
31,
32,
33]. In the validation phase, YOLO models consistently featured in high-performing ensembles, suggesting that their inclusion may be valuable when constructing ensembles for processing low-altitude imagery. The medium YOLO, in particular, appeared to be a valuable contributor to ensemble performance. While not subject to final testing, the relatively strong AP achieved by the 9× and 10× tiny YOLO ensembles in validation recommends them for innovations such as on-board processing, as they are very lightweight in nature.
Perhaps surprisingly, the inclusion of RetinaNet models did not appear to be advantageous, and RetinaNet models were present in some of the lowest AP ensembles in the validation stage. The entire suite of ‘All Det2’ ensembles achieved lower AP than any of the ensembles containing YOLO components; however, the ten ‘All Det2’ ensembles that included RetinaNet models (with copies ranging in number between 1 and 10) were the lowest scoring of all. It is perhaps contradictory then that ‘Mix 10’, which performed best in testing, was the only tested ensemble with a RetinaNet component. ‘Mix 10’ however was a large and diverse ensemble, and it is possible that the computational power of the other components overcame any impediment the RetinaNet may have presented. It may be useful to test the same combination with the single RetinaNet excluded.
As well as performance accuracy, the evaluation of object detection models should encompass computational complexity and inference (processing) time [
28]. In this study, ensembles which included YOLO detectors had lower processing times which is to be expected given YOLO’s single step architecture. The longest run times occurred with ensembles containing more than one copy of an F-RCNN models, both 50- and 101-layer, which is again unsurprising given their region-based approach and more computationally complex backbone. Surprisingly, however, the inclusion of single-step RetinaNet components did not correspond with shorter run times. As drone-acquired datasets are generally large compared to typical photographic imagery, processing time is likely to be an important consideration in the context of small-bodied wildlife detection. The optimal ensemble will need to provide a judicious balance between inference time and accuracy for a given monitoring activity, but the inclusion of at least some YOLO components is strongly indicated.
The greatest impediment to higher precision in our ensemble approach was the number of false positive objects that were tracked. This was particularly the case for the ‘All Det2 12’ ensemble, which produced 100% recall but a challenging number of tracked objects. To reduce false positives, future studies could apply a threshold to discard objects that are not tracked over some minimum number of frames. It is also possible, however, that uncollared koalas may have been present in the survey area so that detections that appear spurious could in fact have been correct.
In addition to the novel application of ensemble learning for automated detection of small-bodied wildlife from low-altitude drone surveys, an important feature of this study is the quantitative approach that has been devised to measure and compare model performance. Explicit rates of precision are rarely reported in ecological studies where drone surveys have been combined with automated wildlife detection [
20]. Deep learning ensembles in this study have achieved very high AP when tested on validation data. Our training data in this study intentionally encompassed a broader range of environments and habitats than [
18], an approach designed to ensure greater robustness across more diverse settings, compared to [
18] which was specifically trained for detections at Petrie. As a result, it was expected that performance would decrease when used on unseen testing data, where contextual semantic (background) information differed from that of the training data. However, precision can be reasonably expected to increase over time as continuous fine-tuning is undertaken based on errors identified in each new setting. Additionally, approximately 50% of the images in our validation dataset contained a koala which is a far greater concentration than the proportion found in one complete survey such as our testing dataset.