2.1. Datasets
CR-CNN was trained and evaluated on datasets comprising both real-world and synthetic data. We created the synthetic dataset as a simple and standard dataset with minimal noise and other controlled variables in order to form a baseline for the proposed methodologies. Exploratory evaluation on the synthetic data determined whether the expected outcome of both strategies was met before their application to more complex and challenging real-world data. The synthetic data consist of multiple colored shapes on different colored backgrounds, with the inclusion of some challenging and obscure objects, such as overlapping shapes. The synthetic training dataset comprises 1500 images totaling 33,520 objects—24% of which are smaller than pixels and 69.58% of which are smaller than pixels. The evaluation dataset comprises 150 images totaling 6786 objects, with 1651 shapes smaller than pixels.
For real-world aerial data, we trained and evaluated our proposed architecture on two datasets—the Aerial-Cars dataset [
54] and the Vehicle-Detection dataset [
55]—chosen for their challenging characteristics compared to other common UAV-based datasets [
38,
53,
56]. Both include multiple camera angles, altitudes and complex backgrounds, which accurately capture the challenging nature of aerial-based data.
The Aerial-Cars dataset was further annotated with an additional 546 vehicles—primarily located closer to the horizon or partially obstructed—to provide a more realistic environment for stress-testing the detection architectures. The Aerial-Cars training dataset contains 3174 objects—of which 539 are smaller than
pixels—within 155 images, with an average image size of
. The evaluation dataset consisted of 33 images, an example of which can be seen in
Figure 1.
The Vehicle-Detection training dataset (sub-datasets 1, 2 and 3) comprised 1495 images that contained 19,348 objects—only car annotations were used for training and evaluation. Of these 19,348 objects, 80% were smaller than
pixels. The evaluation dataset (subset 4) comprised 190 images containing a total of 3647 objects—3325 of which were smaller than the COCO-defined
pixel definition. The distribution of object size within the Vehicle-Detection dataset, with a comparison to the synthetic and Aerial-Cars data, can be seen in
Figure 2.
Finally, the proposed inflated IoU training strategy was used to train and evaluated on the 2017 COCO dataset [
14]. While it is not an aerial-based dataset, this comparison allowed for this novel approach to be evaluated for use in other applications. The detection results of our proposed architectures compared to those of existing methods can be seen in
Section 3.
2.3. Inflated IoU Training
CR-CNN addresses the shortcomings of the RPN through the prediction of new region proposals using information distinct from that of the FPN. However, this method fails to improve upon the classic RPN in [
4], which will still fail to recognize small objects, as documented by [
34,
35]. Addressing this problem through a new training strategy allows for the standard RPN architecture to detect regions containing smaller objects better in unity with the new branch of CR-CNN while minimizing increases in the computation or implementation costs.
A number of different IoU losses have been proposed, including GIoU [
67], DIoU and CIoU [
68]; however, regardless of the IoU-based loss utilized, smaller objects will generally be assigned positive labels at lower rates than larger objects. We trained CR-CNN using a modified training strategy that artificially inflates the IoU of object–anchor pairs for smaller objects within the dataset. The increased IoU allows for more small objects to be assigned positive training labels by surpassing the 0.7 IoU threshold and increases the number of small objects utilized while training the RPN. This achievement results in improved detection of similar smaller objects during evaluation. While data augmentation strategies such as [
35,
69,
70] frame the problem of poor detection performance on small objects as a general lack of small objects during training, we define the problem as a lack of objects utilized during training. This approach avoids the shortcomings introduced by data augmentation methods such as the existence of objects out of context relative to the background or stark changes in illumination.
According to the
term from (
1), an object only contributes to the training of the RPN when an anchor is assigned a positive training label
. Every anchor is assigned a binary training label, with an anchor assigned a positive label in two scenarios: (i) the anchor has the highest IoU overlap with a groundtruth box or (ii) the anchor has an IoU overlap greater than 0.7 with any groundtruth box [
36]. Our approach specifically targets rule (ii)—which allows a groundtruth box to assign a positive training label to multiple anchors—and aims to increase the number of small objects within the dataset that surpass the 0.7 IoU threshold of CR-CNN.
Figure 4 illustrates the relation between the overlaid anchor proposals and the groundtruth box for a single object and the typically low IoU as a result. By inflating the IoU of smaller objects with each anchor, we effectively increase the size of the object (enlarging the red annotation box) or decrease the size of the anchor (dashed green boxes) in order to produce IoUs greater than the threshold value.
Our proposed method simply multiplies the IoU of every anchor–groundtruth pair using a scaling value
determined by the Gaussian function
where
a and
control the scaling magnitude and range of object size that the IoU inflation is applied to, while
b dictates the object size that the largest scaling factor is applied to based on the object bounding box area
x. Increases in
will allow for more objects to have their IoU artificially increased, while decreasing
will allow for the scaling to be focused on a more specific size of object. It was found that the overall detection began to suffer when the
values were too small or too large, impacting large object and small object detection, respectively. It was also found that that we could influence the detection accuracy based on object size by shifting the center point of the Gaussian using hyperparameter
b. However, at a certain point, as the value of
b increased, the performance gains decreased, as would be expected. This work used the values
,
and
but did not pursue a thorough optimization of these values beyond demonstrating improved detection results. The new inflated IoU is then calculated as follows:
where
is the base IoU of each object–anchor pair, and
is the new IoU used to assign the binary training label
during RPN training. In implementation, this process can be achieved with two efficient array calculations—calculating a Tensor of the scaling values from a Tensor of the object areas first before then multiplying each optimal IoU by its related scaling factor. During training, there was no notable impact on the training time; however, this would be a valuable consideration in future work.
The use of the Gaussian function was found to allow for the IoU scaling to be applied to objects within a specific size range better compared to step, square and sigmoid rules, as demonstrated in
Figure 5. Each of the different curves was chosen for its ability to isolate the IoU inflation to smaller objects. The detection accuracy of Mask R-CNN on the Aerial-Cars data trained with the various rules can be seen in
Table 2. The results in the table represent the best performance found for each of the scaling rules. The Gaussian function is believed to outperform the other approaches due to the function allowing for objects that fall within the center of the size range to have their IoU scaled by a larger value than objects farther from the center, which permits the inflation to be applied to specific datasets better.
While the training strategy was used with CR-CNN to improve the detection of smaller objects by the classical RPN, the generality of this approach allows for the method to be applied to any anchor-based detection architecture. To demonstrate this, the training strategy was easily adapted to the YOLOv5 [
50] architecture, which follows the anchor labeling rules of [
36], but the overlap of the anchors and groundtruth objects was compared using a width ratio
and a height ratio
calculated as follows:
where the groundtruth annotation width
and height
are compared to the anchor box width
and height
. The maximum among both ratios and their inverse is computed as
, and any anchor–groundtruth pair whose
is less than
is assigned a positive training label. This allows for a scaling value to still be calculated following (
8) and easily applied following (
9) where
is the originally calculated
value and
is the inflated
, which, again, is compared to the
threshold. The general applicability of this approach is explored in
Section 3 with a comparison of CR-CNN, Mask R-CNN and YOLOv5 as trained using the inflation strategy compared to using a typical data augmentation strategy.
Throughout this work, CR-CNN was implemented using Detectron2 [
71]. Both the Aerial-Cars and Vehicle-Detection datasets include only bounding box annotations of vehicles and not segmentation masks; therefore, within the scope of this work, Mask R-CNN can be considered an improved form of Faster R-CNN. The architecture only predicts bounding boxes using five FPN scales and RoIAlign without the segmentation prediction branch. Our Mask R-CNN implementation also included the additional bottom-up pathway and the adaptive feature pooling from [
32] and was implemented using Detectron2. All models that incorporated an FPN backbone utilized a ResNeXt-101 model pre-trained on ImageNet. Each model was trained for the standard 300 epochs except for DETR, which was trained using 500 epochs, as in [
51]. Besides DETR, all the other models were found to stabilize around the 300-epoch mark. All the other parameters followed the work of [
4,
50], and default anchor boxes were used.