A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection

Butler, Justin; Leung, Henry

doi:10.3390/rs16214065

Open AccessArticle

A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection

by

Justin Butler

^*

and

Henry Leung

Department of Electrical and Software Engineering, University of Calgary, Calgary, AB T2N 1N4, Canada

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(21), 4065; https://doi.org/10.3390/rs16214065

Submission received: 2 October 2024 / Revised: 26 October 2024 / Accepted: 27 October 2024 / Published: 31 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

Object detection architectures struggle to detect small objects across applications including remote sensing and autonomous vehicles. Specifically, for unmanned aerial vehicles, poor detection of small objects directly limits this technology’s applicability. Objects both appear smaller than they are in large-scale images captured in aerial imagery and are represented by reduced information in high-altitude imagery. This paper presents a new architecture, CR-CNN, which predicts independent regions of interest from two unique prediction branches within the first stage of the network: a conventional R-CNN convolutional backbone and an hourglass backbone. Utilizing two independent sources within the first stage, our approach leads to an increase in successful predictions of regions that contain smaller objects. Anchor-based methods such as R-CNNs also utilize less than half the number of small objects compared to larger ones during training due to the poor intersection over union (IoU) scores between the generated anchors and the groundtruth—further reducing their performance on small objects. Therefore, we also propose artificially inflating the IoU of smaller objects during training using a simple, size-based Gaussian multiplier—leading to an increase in the quantity of small objects seen per training cycle based on an increase in the number of anchor–object pairs during training. This architecture and training strategy led to improved detection overall on two challenging aerial-based datasets heavily composed of small objects while predicting fewer false positives compared to Mask R-CNN. These results suggest that while new and unique architectures will continue to play a part in advancing the field of object detection, the training methodologies and strategies used will also play a valuable role.

Keywords:

object detection; convolutional neural network; Mask R-CNN; UAV

Graphical Abstract

1. Introduction

Recent work in the field of object detection has led to the ever-increasing performance of state-of-the art models. Common architectures continue to be improved upon to achieve a greater detection accuracy while attaining increased detection speeds [1,2,3] and the ability to predict segmentation masks [4]. However, modern architectures continue to detect small objects at rates significantly lower than those for larger objects [5,6,7].

Smaller objects pose multiple challenges to detection models. Represented by fewer pixels, they provide less information for both training and detection. By occupying a smaller area of an image, small objects also require models to predict their locations with a higher level of precision. Despite these challenges, the detection of small objects is an important aspect of many applications, including remote sensing [8,9,10] and pedestrian and sign detection for autonomous vehicles [11,12,13].

A prime example of an application that primarily involves the detection of smaller objects is found in unmanned aerial vehicle (UAV) imagery. Often, the objects in these applications both meet the classical definition of small objects—smaller than

32 \times 32

pixels, as defined in COCO [14]—and are relatively smaller compared to the size of the images commonly captured in these applications. The capability of a UAV to cover vast amounts of terrain while carrying high-definition imaging equipment has allowed for this technology to be adapted to applications such as wildlife tracking [15], infrastructure monitoring [16,17] and search and rescue [18]. However, the benefits UAVs provide in object detection applications are undercut if the detection architecture fails to detect smaller objects.

Numerous methodologies have been explored for improving the detection of these challenging objects, including multi-scale learning, data augmentation, generative adversarial networks, unique training strategies, context-based learning and super-resolution approaches [6,19,20,21,22,23]. The work of [24] proposed the use of a super-resolution generative adversarial network (GAN) to increase the resolution of images before detection to discern smaller objects better. Numerous forms of improvement to the Feature Pyramid Network (FPN) backbone have been proposed to increase the extraction of information related to smaller objects [25,26,27,28,29]. While the work of [30,31] proposed new training strategies for improving the learning of smaller objects on the general COCO dataset.

In this paper, we aim to improve the performance of modern architectures in the detection of small objects—demonstrated through the detection of vehicles in challenging UAV-based imagery. We explore two means of improving detection: a new architecture composed of two parallel and unique backbone networks and a novel training methodology to improve the number of small objects seen during training.

We first propose an architecture called Center R-CNN (CR-CNN) to demonstrate that the Region Proposal Network (RPN) of Mask R-CNN [4] can benefit from additional information captured from a second unique branch. This parallel branch—which predicts the center of an object’s mass—utilizes information different from that in the FPN backbone of Mask R-CNN and allows for additional unique regions of interest (RoIs) to be predicted that capture small objects better.

Secondly, to improve the learned detection of small objects, the RPNs are trained using a new strategy that inflates the intersection over union (IoU) between each object’s anchor–groundtruth pairs. This bolsters the number of small objects utilized during training and is detailed in Section 2.3.

Finally, the performance of CR-CNN is evaluated on synthetic data first and then on two UAV-based datasets to demonstrate the applicability of the two strategies to challenging real-world applications. This is followed by a discussion of the results and future directions of research in Section 4.

Small Object Detection

Generally, R-CNN architectures outperform single-stage methods in detecting small objects [5]. On the MS-COCO dataset, Mask R-CNN achieves higher AP_S scores compared to those of competitive Faster R-CNN models and single-stage models such as YOLOv3 [1]. Improved architectures that are built upon Mask R-CNN [32] have further outperformed architectures such as YOLOv4 [2] and RetinaNet [33] in terms of AP, AP₅₀ and AP_S.

However, shortcomings of the RPN have been acknowledged. The work of [34] demonstrated that the region proposals from all five RPN layers of Faster R-CNN achieve lower average precision scores as the object size decreases. As documented by [35], a driving factor behind the under-performance of modern object detection architectures on small objects is their common reliance on anchor-based regression. While training the Mask R-CNN RPN on the COCO dataset, large objects match an average of 2.54 anchors, while small objects only match an average of 1 anchor per object—the absolute minimum possible—according to the object size definitions of [14]. This suggests that smaller objects are under-represented during training, which results in the classic RPN failing to recognize smaller objects and thereby proposing fewer regions that contain small objects.

Detection limitations are further compounded in aerial-based imagery, where objects are generally smaller relative to the size of images and located within complex environments. For example, small objects in the Vehicle-Detection dataset comprise 80% of all objects, as seen in Table 1, yet occupy 0.61% of the total image area in the dataset. The Aerial-Cars dataset follows a similar pattern, with small and medium objects totaling over 90% of all objects yet occupying only 2.66% of the total pixel space. According to occupied area alone, small objects are already at a disadvantage in remote sensing data.

The lack of pixels that a small object occupies creates extra challenges when training anchor-based methods. Within the training pipeline, each of the thousands of generated anchors are utilized as a positive training example if overlapping with a groundtruth object by a certain IoU threshold [1,36]. However, large objects will produce multiple anchors that exceed the threshold and will therefore be seen multiple times during training—creating a large object bias.

When trained on COCO, YOLOv5 was seen to match an average of 1.0 anchor per object less than

32^{2}

pixels while matching an average of 2.45 anchors per object larger than

96^{2}

pixels. On Aerial-Cars, YOLOv5 matched an equally poor number of 1.05 anchors per small object and 2.45 anchors per large object. Mask R-CNN also faced similar challenges, matching an average of 1.0 anchor per small object on COCO. This indicates that smaller objects are utilized less often during training, creating a large object bias in anchor-based methods. An approach to bolstering the number of small objects utilized during training would likely improve their detection.

Numerous approaches to improving the performance of detection architectures on aerial-based data have been trialed. The work of [37] explored the use of a super-resolution generative adversarial network to increase the resolution of objects found within the VisDrone dataset [38] before detection by an FPN. Terrail and Jurie [39] utilized hard-mining on aerial-based images to achieve an eight-point increase in the AP compared to that of support vector machine detectors on the VeDAI dataset. Alternatively, the work of [40,41] proposed how tiling of an image could improve the detection of smaller objects. Anchor-less and orientated detectors have also received a lot of attention as a successful way to improve the detection of small objects in remote sensing imagery. This includes the General Gaussian Heatmap Label Assignment (GGHL) [42], the Appearance Sensitivity Detector (ASDet) and its unique feature extractor and dual appearance-aware loss [43] and HRDet with its approach to negating noisy and complex backgrounds and its novel regression function EOIoU [44].

Further approaches include data augmentation and a modified RPN, as proposed by [45], to pool feature maps to improve the detection of small objects in the Munich dataset. The work of [46] divides aerial-based data in the UAVDT dataset into specific domains based on imaging altitude, weather and other environmental characteristics for an increase in the AP of 2.27 points for Faster R-CNN. Additionally, the work of [47] aims to improve the performance of RoIs by incorporating ground sample distance into the detection pipeline. Finally, the work of [48] utilizes frequency-domain-based information to further extract valuable and rare information from aerial-based imagery. However, despite these and various other attempts to improve the detection of objects in UAV-based data, modern detectors continue to struggle with this challenge.

The work of [38] demonstrated that R-CNN-based detectors generally achieve higher detection accuracies on UAV-based data compared to other architectures; Ref. [49] documented the higher performance of Faster R-CNN compared to YOLOv3 and YOLOv4 on the Stanford Drone Dataset. Furthermore, Mask R-CNN demonstrates the superior performance of the R-CNN architecture on smaller objects compared to YOLOv5 [50] and the Detection Transformer (DETR) [51] on a supplemented Aerial-Cars dataset—exceeding both architectures by up to 17.3 points in AP_S. However, despite outperforming other architectures, the R-CNN architecture still fails to meet acceptable performance levels on UAV-based data. Faster R-CNN demonstrated significant under-performance on the low-altitude VisDrone dataset [52] and the Stanford Drone Dataset [49,53]. These results suggest that the R-CNN architecture can be improved upon to meet the challenges posed by small objects such as those in UAV-based imagery better.

The following work will detail the proposed new architecture and a unique training strategy that increases the number of small objects seen during training. This is followed by our results demonstrating the success of CR-CNN in detecting small objects in aerial imagery and the better results of common detection architectures when they are trained using our inflated IoU training strategy.

2. Materials and Methods

2.1. Datasets

CR-CNN was trained and evaluated on datasets comprising both real-world and synthetic data. We created the synthetic dataset as a simple and standard dataset with minimal noise and other controlled variables in order to form a baseline for the proposed methodologies. Exploratory evaluation on the synthetic data determined whether the expected outcome of both strategies was met before their application to more complex and challenging real-world data. The synthetic data consist of multiple colored shapes on different colored backgrounds, with the inclusion of some challenging and obscure objects, such as overlapping shapes. The synthetic training dataset comprises 1500 images totaling 33,520 objects—24% of which are smaller than

32^{2}

pixels and 69.58% of which are smaller than

96^{2}

pixels. The evaluation dataset comprises 150 images totaling 6786 objects, with 1651 shapes smaller than

32^{2}

pixels.

For real-world aerial data, we trained and evaluated our proposed architecture on two datasets—the Aerial-Cars dataset [54] and the Vehicle-Detection dataset [55]—chosen for their challenging characteristics compared to other common UAV-based datasets [38,53,56]. Both include multiple camera angles, altitudes and complex backgrounds, which accurately capture the challenging nature of aerial-based data.

The Aerial-Cars dataset was further annotated with an additional 546 vehicles—primarily located closer to the horizon or partially obstructed—to provide a more realistic environment for stress-testing the detection architectures. The Aerial-Cars training dataset contains 3174 objects—of which 539 are smaller than

32^{2}

pixels—within 155 images, with an average image size of

[1777, 987]

. The evaluation dataset consisted of 33 images, an example of which can be seen in Figure 1.

The Vehicle-Detection training dataset (sub-datasets 1, 2 and 3) comprised 1495 images that contained 19,348 objects—only car annotations were used for training and evaluation. Of these 19,348 objects, 80% were smaller than

32^{2}

pixels. The evaluation dataset (subset 4) comprised 190 images containing a total of 3647 objects—3325 of which were smaller than the COCO-defined

32 \times 32

pixel definition. The distribution of object size within the Vehicle-Detection dataset, with a comparison to the synthetic and Aerial-Cars data, can be seen in Figure 2.

Finally, the proposed inflated IoU training strategy was used to train and evaluated on the 2017 COCO dataset [14]. While it is not an aerial-based dataset, this comparison allowed for this novel approach to be evaluated for use in other applications. The detection results of our proposed architectures compared to those of existing methods can be seen in Section 3.

2.2. Architecture

2.2.1. Mask R-CNN

The base Mask R-CNN architecture is divided into two stages, with the first stage utilizing an FPN and an RPN to predict RoIs in the form of rectangular object proposals—each with an objectness score [4]. This first stage can be seen within the left-most red-hatched box in Figure 3. To generate these proposals, a small convolutional layer is slid across the last shared layer of the FPN backbone before two parallel

1 \times 1

convolutions are applied for box regression and box classification. Each sliding window location predicts k proposals with their coordinates predicted relative to 15 differently sized anchor boxes [4]. This first-stage architecture is heavily based off the work of [36].

Following the prediction of the RoI proposals, the second stage of Mask R-CNN performs classification, localization and segmentation. From each RoI, a small feature map is extracted by RoIAlign—allowing for each floating-number RoI to accurately represent a region from the discrete feature maps of the FPN [4]. Having extracted a small

7 \times 7

feature map from each RoI, the network head extends the work of [36] and [57] and introduces a third parallel convolution branch to predict the segmentation masks—a feature not utilized by this work due to the lack of segmented aerial-based data. This second stage, including the classification and box regression prediction branches, can be seen in the right-most red-hatched box in Figure 3.

The performance of Mask R-CNN is heavily dependent on the performance of the RPN—object locations missed during the region proposal stage will never be detected by the prediction heads. The FPN-based RPN has demonstrated successful results in detecting medium and large objects that should be retained while aiming to improve the performance of Mask R-CNN on small objects. A second parallel RPN branch can instead supplement the region proposals by predicting RoIs from convolutional information that is distinct from that from the FPN backbone. This architecture can be seen in Figure 3.

The R-CNN approach [58] sequentially predicts the region proposals before predicting objects in a two-stage fashion. Faster R-CNN [36] improved upon earlier methodologies by adopting the attention mechanism of [59] in the RPN. Mask R-CNN [4] further built upon the results of [36] through the adoption of an FPN backbone [57] and a parallel third prediction branch for the successful prediction of segmentation masks. The R-CNN architecture can also be easily improved upon: the PANet paper [32] demonstrated improved performance through using an additional bottom-up backbone pathway and the fusion of FPN feature maps.

The RPN of [36] predicts a set of region proposal bounding boxes with equivalent objectness scores from a given image. A small convolutional network acts in a sliding-window fashion across each hierarchical layer of the backbone—five layers in the case of Mask R-CNN. Each sliding-window location predicts multiple proposals as offsets of fixed-size anchors with three aspect ratios and N scales. Mask R-CNN uses

N = 5

in accordance with the five FPN scales and therefore utilizes 15 anchors at each sliding-window location. During training, the RPN is only trained on objects whose groundtruth annotations, with coordinates

t_{i}^{*}

, overlap with an anchor by more than 0.7 following (1):

\begin{matrix} L (p_{i}, t_{i}) & = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (p_{i}, p_{i}^{*}) \\ + \frac{λ}{N_{r e g}} \sum_{i} p_{i}^{*} L_{r e g} (t_{i}, t_{i}^{*}) \end{matrix}

(1)

where anchor–object pairs with an IoU greater than 0.7 are assigned a positive groundtruth label

p_{i}^{*} = 1

and pairs with an IoU of less than 0.3 are assigned a negative label

p_{i}^{*} = 0

following [36]. The label

p_{i}^{*}

ensures that the regression loss

L_{r e g}

is only computed for positive training labels, meaning the RPN learns to better propose regions only when objects have significant IoU overlap with any anchor. Following [36], the classification loss

L_{c l s}

is a binary log loss between the predicted objectness scores

p_{i}

and the groundtruth label

p_{i}^{*}

.

L_{r e g}

is the robust loss function of [60] between the vector

t_{i}

representing the four box coordinates of the predicted region proposal and the vector

t_{i}^{*}

of the coordinates of the respective groundtruth object according to index i. The two terms are normalized by the constants

N_{c l s}

and

N_{r e g}

and are equally weighted by

λ

[36].

2.2.2. CR-CNN

The work of [61,62,63] introduced object detection methods utilizing keypoints predicted from objectness heatmaps rather than bounding box anchors; all three approaches first generate heatmap feature maps using a stacked hourglass network [64]. The secondary RPN branch of CR-CNN treats the prediction of region proposals similar to these keypoint-based works—by first extracting objectness heatmaps of an image before predicting the location of the RoIs. In this manner, the center of a single predicted RoI can be treated as a central keypoint, predicted from a heatmap solely focused on the existence of an object without focusing on the object’s finer details.

CR-CNN downsamples the input image once by a stride of two, and following the work of previous keypoint-based approaches, the objectness heatmap is extracted using two consecutive stacked hourglass modules, with each module first reducing and subsequently upsampling the image five times. For an input image

I \in R^{W \times H \times 3}

, the stacked hourglass network produces a heatmap

\hat{Y} \in {[0, 255]}^{\frac{W}{R} \times \frac{H}{R} \times C}

where

R = 4

is the stride of the output heatmap, following other heatmap-based detectors [61,62,65,66]. The heatmap strictly predicts the existence of an object and does not take an object’s classification into consideration; therefore, the CR-CNN heatmap has an output dimension of

C = 1

. This is implemented within the network as a

1 \times 1

convolution following the stacked hourglass network with an output dimension of one. Regions of the heatmap that have values near 255 are considered to be regions likely to contain an object.

During training, the stacked hourglass network is trained using a penalty-reduced pixel-wise logistic regression loss function with focal loss modified for a single category

C = 1

[61,63] as follows:

L_{h e a t} = \frac{- 1}{N} \sum_{x y} \{\begin{matrix} {(1 - {\hat{Y}}_{x y})}^{α} log ({\hat{Y}}_{x y}), & if Y_{x y} = 255 \\ \begin{matrix} (1 - Y_{x y})^{β} {({\hat{Y}}_{x y})}^{α} \\ log (1 - {\hat{Y}}_{x y}) \end{matrix} & otherwise, \end{matrix}

(2)

for N keypoints within the image I where

\hat{Y}

is the predicted heatmap and Y is the groundtruth heatmap—both of resolution

x \times y

. The focal loss values

α = 2

and

β = 4

follow those of [33].

Each image’s equivalent groundtruth heatmap

Y \in {[0, 255]}^{\frac{W}{R} \times \frac{H}{R} \times 1}

is composed of objects splat onto the heatmap using a Gaussian kernel as follows:

Y_{x y} = exp (\frac{- {(x - {\tilde{p}}_{x})}^{2} + {(y - {\tilde{p}}_{y})}^{2}}{2 σ_{p}^{2}})

(3)

with low-resolution object center keypoints

{\tilde{p}}_{x} = ⌊ \frac{x}{R} ⌋

and

{\tilde{p}}_{y} = ⌊ \frac{y}{R} ⌋

at an

{x, y}

coordinate of the heatmap, where

σ_{p}

is the object-size-dependent standard deviation calculated to be

\frac{1}{3}

the size of the object’s width or height, whichever is greater [61]. Following [65], the element-wise maximum of their kernels is taken when two objects overlap.

Two RoI prediction heads are considered in order to reinforce that the additional RPN supplementation is more influential than the specific RoI prediction head architecture itself. A prediction head closely mirroring the work of [63] treats RoI predictions as a regression problem—predicting RoI centers, offsets and sizes—and is referred to as CR-CNN-off throughout this work. A second simplified head more closely follows the RPN structure of [36] but instead predicts RoIs in absolute coordinates and is referred to as CR-CNN-abs.

CR-CNN-off: Inspired by [63], the RoI centerpoint, offset and size values are predicted from the stacked hourglass heatmap. A local offset value

L_{o f f}

is predicted to offset the discretization error introduced by the output stride R in the calculation of

{\tilde{p}}_{x}

and

{\tilde{p}}_{y}

. The offset predictions are trained using an L1 loss as follows:

L_{o f f} = \frac{1}{N} \sum_{p} | {\hat{O}}_{\tilde{p}} - (\frac{p}{R} - \tilde{p}) |

(4)

for the predicted local offset values

\hat{O}

from each object centerpoint. The

(\frac{p}{R} - \tilde{p})

term ensures the loss term is only applied when an object exists within the groundtruth heatmap Y. The offset values are predicted in the form

{\hat{O}}_{{\hat{x}}_{k}, {\hat{y}}_{k}} = (δ {\hat{x}}_{k}, δ {\hat{y}}_{k})

for object k.

The RoI size predictions are trained following the L1 loss:

L_{s i z e} = \frac{1}{N} \sum_{k = 1}^{N} | {\hat{S}}_{p k} - s_{k} |

(5)

where the object size

{\hat{S}}_{p k}

is the predicted width and height—the dimensions of the region proposals in the case of CR-CNN—measured against the groundtruth object size

s_{k} = (x_{2}^{(k)} - x_{1}^{(k)}, y_{2}^{(k)} - y_{1}^{(k)})

for object k of N objects.

Following [63], both

L_{o f f}

and

L_{s i z e}

are predicted alongside the centerpoint coordinates from the output of the stacked hourglass network utilizing identical

3 \times 3

convolutions, followed by a ReLU layer and additional

1 \times 1

convolution. These operations are performed by the independent RPN within the blue-hatched border in Figure 3.

During training, the supplemental RPN branch is trained as follows:

L_{r p n} = L_{h e a t} + λ_{s i z e} L_{s i z e} + λ_{o f f} L_{o f f}

(6)

where the loss function comprises the heatmap loss

L_{h e a t}

defined in (2), the size loss

L_{s i z e}

and the offset loss

L_{o f f}

. The size and offset loss values are scaled by the values

λ_{s i z e} = 0.1

and

λ_{o f f} = 1

following [63].

CR-CNN-abs: The supplemental RoIs are again predicted from an objectness heatmap

\hat{Y}

—extracted with an identical stacked hourglass network to that in CR-CNN-off. The RoI prediction heads closely follow the RPN architecture of [36] for simplicity—two parallel fully connected layers predicting the bounding box coordinates and objectness scores from

\hat{Y}

. The supplemental RPN branch is trained as follows:

\begin{matrix} L_{r p n} = L_{h e a t} + \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (p_{i}, p_{i}^{*}) \\ + \frac{λ}{N_{l o c}} \sum_{i} p_{i}^{*} L_{l o c} (v_{i}, v_{i}^{*}) \end{matrix}

(7)

where the localization loss

L_{l o c}

is the robust loss function of [60] between the predicted RoI of coordinates

v_{i}

and the groundtruth objects with absolute coordinates

v_{i}^{*} = {v_{x}^{*}, v_{y}^{*}, v_{w}^{*}, v_{h}^{*}}

. The use of absolute coordinates avoids the use of anchors by the supplemental branch. Following [36], the classification loss

L_{c l s}

is a binary log loss between the predicted objectness score

p_{i}

and the binary training label

p_{i}^{*}

assigned to each RoI based on the IoU overlap between each RoI–groundtruth pair. The heatmap loss

L_{h e a t}

is again calculated following (2).

2.3. Inflated IoU Training

CR-CNN addresses the shortcomings of the RPN through the prediction of new region proposals using information distinct from that of the FPN. However, this method fails to improve upon the classic RPN in [4], which will still fail to recognize small objects, as documented by [34,35]. Addressing this problem through a new training strategy allows for the standard RPN architecture to detect regions containing smaller objects better in unity with the new branch of CR-CNN while minimizing increases in the computation or implementation costs.

A number of different IoU losses have been proposed, including GIoU [67], DIoU and CIoU [68]; however, regardless of the IoU-based loss utilized, smaller objects will generally be assigned positive labels at lower rates than larger objects. We trained CR-CNN using a modified training strategy that artificially inflates the IoU of object–anchor pairs for smaller objects within the dataset. The increased IoU allows for more small objects to be assigned positive training labels by surpassing the 0.7 IoU threshold and increases the number of small objects utilized while training the RPN. This achievement results in improved detection of similar smaller objects during evaluation. While data augmentation strategies such as [35,69,70] frame the problem of poor detection performance on small objects as a general lack of small objects during training, we define the problem as a lack of objects utilized during training. This approach avoids the shortcomings introduced by data augmentation methods such as the existence of objects out of context relative to the background or stark changes in illumination.

According to the

p_{i}^{*} L_{r e g} (t_{i}, t_{i}^{*})

term from (1), an object only contributes to the training of the RPN when an anchor is assigned a positive training label

p_{i}^{*} = 1

. Every anchor is assigned a binary training label, with an anchor assigned a positive label in two scenarios: (i) the anchor has the highest IoU overlap with a groundtruth box or (ii) the anchor has an IoU overlap greater than 0.7 with any groundtruth box [36]. Our approach specifically targets rule (ii)—which allows a groundtruth box to assign a positive training label to multiple anchors—and aims to increase the number of small objects within the dataset that surpass the 0.7 IoU threshold of CR-CNN. Figure 4 illustrates the relation between the overlaid anchor proposals and the groundtruth box for a single object and the typically low IoU as a result. By inflating the IoU of smaller objects with each anchor, we effectively increase the size of the object (enlarging the red annotation box) or decrease the size of the anchor (dashed green boxes) in order to produce IoUs greater than the threshold value.

Our proposed method simply multiplies the IoU of every anchor–groundtruth pair using a scaling value

S_{I o U}

determined by the Gaussian function

S_{I o U} = a * exp (- \frac{{(x - b)}^{2}}{2 σ^{2}})

(8)

where a and

σ

control the scaling magnitude and range of object size that the IoU inflation is applied to, while b dictates the object size that the largest scaling factor is applied to based on the object bounding box area x. Increases in

σ

will allow for more objects to have their IoU artificially increased, while decreasing

σ

will allow for the scaling to be focused on a more specific size of object. It was found that the overall detection began to suffer when the

σ

values were too small or too large, impacting large object and small object detection, respectively. It was also found that that we could influence the detection accuracy based on object size by shifting the center point of the Gaussian using hyperparameter b. However, at a certain point, as the value of b increased, the performance gains decreased, as would be expected. This work used the values

a = 0.3

,

b = 1000

and

σ = 750

but did not pursue a thorough optimization of these values beyond demonstrating improved detection results. The new inflated IoU is then calculated as follows:

I o U_{i n f l a t e d} = I o U_{o l d} * (1 + S_{I o U})

(9)

where

I o U_{o l d}

is the base IoU of each object–anchor pair, and

I o U_{i n f l a t e d}

is the new IoU used to assign the binary training label

p_{i}^{*}

during RPN training. In implementation, this process can be achieved with two efficient array calculations—calculating a Tensor of the scaling values from a Tensor of the object areas first before then multiplying each optimal IoU by its related scaling factor. During training, there was no notable impact on the training time; however, this would be a valuable consideration in future work.

The use of the Gaussian function was found to allow for the IoU scaling to be applied to objects within a specific size range better compared to step, square and sigmoid rules, as demonstrated in Figure 5. Each of the different curves was chosen for its ability to isolate the IoU inflation to smaller objects. The detection accuracy of Mask R-CNN on the Aerial-Cars data trained with the various rules can be seen in Table 2. The results in the table represent the best performance found for each of the scaling rules. The Gaussian function is believed to outperform the other approaches due to the function allowing for objects that fall within the center of the size range to have their IoU scaled by a larger value than objects farther from the center, which permits the inflation to be applied to specific datasets better.

While the training strategy was used with CR-CNN to improve the detection of smaller objects by the classical RPN, the generality of this approach allows for the method to be applied to any anchor-based detection architecture. To demonstrate this, the training strategy was easily adapted to the YOLOv5 [50] architecture, which follows the anchor labeling rules of [36], but the overlap of the anchors and groundtruth objects was compared using a width ratio

r_{w}

and a height ratio

r_{h}

calculated as follows:

r_{w} = \frac{w_{g t}}{w_{a n c h}}, r_{h} = \frac{h_{g t}}{h_{a n c h}}

(10)

r_{w}^{m a x} = max (r_{w}, \frac{1}{r_{w}}), r_{h}^{m a x} = max (\frac{1}{r_{h}})

(11)

\begin{matrix} r^{m a x} = max (r_{w}^{m a x}, r_{h}^{m a x}) \end{matrix}

(12)

where the groundtruth annotation width

w_{g t}

and height

h_{g t}

are compared to the anchor box width

w_{a n c h}

and height

h_{a n c h}

. The maximum among both ratios and their inverse is computed as

r^{m a x}

, and any anchor–groundtruth pair whose

r^{m a x}

is less than

4.0

is assigned a positive training label. This allows for a scaling value to still be calculated following (8) and easily applied following (9) where

I o U_{o l d}

is the originally calculated

r^{m a x}

value and

I o U_{i n f l a t e d}

is the inflated

r^{m a x}

, which, again, is compared to the

4.0

threshold. The general applicability of this approach is explored in Section 3 with a comparison of CR-CNN, Mask R-CNN and YOLOv5 as trained using the inflation strategy compared to using a typical data augmentation strategy.

Throughout this work, CR-CNN was implemented using Detectron2 [71]. Both the Aerial-Cars and Vehicle-Detection datasets include only bounding box annotations of vehicles and not segmentation masks; therefore, within the scope of this work, Mask R-CNN can be considered an improved form of Faster R-CNN. The architecture only predicts bounding boxes using five FPN scales and RoIAlign without the segmentation prediction branch. Our Mask R-CNN implementation also included the additional bottom-up pathway and the adaptive feature pooling from [32] and was implemented using Detectron2. All models that incorporated an FPN backbone utilized a ResNeXt-101 model pre-trained on ImageNet. Each model was trained for the standard 300 epochs except for DETR, which was trained using 500 epochs, as in [51]. Besides DETR, all the other models were found to stabilize around the 300-epoch mark. All the other parameters followed the work of [4,50], and default anchor boxes were used.

3. Results

3.1. Architecture-Derived Improvements

CR-CNN is intended to improve the performance of modern object detection architectures on small objects, such as the challenge of detection in UAV-based imagery. The images within the COCO dataset have average dimensions of

640 \times 480

pixels and an average area of

307, 200

square pixels. Indeed, [14] defines a small object as one with an area less than 1024 square pixels—equivalent to an area

1 / 300

th of the size of a full image. Alternatively, the UAV-based datasets have average dimensions of

1777 \times 987

pixels and an average area of over 1.7 million square pixels. For an object to occupy the equivalent of

1 / 300

th of a UAV-based image, it would need an area of approximately 5800 square pixels or an average area of

76 \times 76

. As shown in Figure 2, the majority of objects across both aerial datasets are less than 6000 square pixels. This suggests the AP₅₀ and AP metrics of the models evaluated are better measures of the architectures’ ability to detect small objects than the MS-COCO AP_S score on these aerial datasets.

The CR-CNN architecture shows consistent improvement over Mask R-CNN, YOLOv5 and DETR, as seen in Table 3 and Table 4. On the synthetic data in Table 3, CR-CNN outperforms Mask R-CNN in all metrics, including more than a 6-point increase in AP_S, which demonstrates improved detection of objects smaller than 1024 square pixels—the majority of the objects in the synthetic data. CR-CNN also demonstrates an improved ability to accurately localize objects compared to YOLOv5 based on its higher AP score—a particularly challenging task when detecting primarily smaller objects.

On the two UAV-based datasets seen in Table 4 and Table 5, CR-CNN outperforms Mask R-CNN, YOLOv5 and DETR in almost every metric Both the absolute and regression prediction heads outperform the other architectures in AP₅₀, demonstrating better detection accuracy. Similarly, the higher AP metrics achieved by the two models show that CR-CNN is not only better at detecting small objects but also at accurately localizing them within large images (Figure 6 and Figure 7).

3.2. Inflation-Derived Improvements

Similar to the evaluation of CR-CNN, the inflated IoU strategy was evaluated on synthetic data, COCO and the same aerial-based datasets, with the results compared to standard Mask R-CNN and YOLOv5, along with the data augmentation strategy of [35] as seen in Table 6.

Evaluating the inflated IoU approach on the COCO dataset showed its impact on detecting objects of more varied sizes and in a different application. These results can be seen in Table 7, and while segmentation was not a focus of this work, for completeness, the inflated IoU method was evaluated on the segmentation predictions of Mask R-CNN, which can be seen in Table 8.

The evaluation of our proposed training strategy against the copy-and-paste augmentation from [35] was intended to examine two very different training approaches. The augmentation strategy produces varying results based on the architecture and dataset in question, and this is believed to be due to the variable impact on training that each copied object produces. Objects and their anchors that do not usually meet the IoU threshold to qualify as a valid training example are likely to produce copies that also fail to meet the threshold—but these copied objects then increase the clutter and complexity of the image, creating an ever more challenging detection environment. The impact of this will depend on both the dataset in question and how the prediction architecture handles complex and busy images.

While the Gaussian curve was found to improve the detection best—and was therefore the inflation rule evaluated within this section—each of the inflation rules may prove advantageous in different scenarios. The step rule would allow for a precise maximum object size to which inflation is applied to while equally inflating the IoU of all objects below this size. For datasets with very polarized object sizes and detection results, where large objects share a common size and excellent detection rate and small objects share a separate size and poor detection rate, then the step rule might be more suitable. Depending on the distribution of size, a square or sigmoid inflation rule might also be appropriate in that situation.

4. Discussion

Overall, the results of Table 4 and Table 5 support the conclusion that an objectness heatmap can successfully improve the RoI proposals of the Mask R-CNN architecture. The extraction of these additional RoIs by CR-CNN directly leads to a higher object detection accuracy in aerial imagery and fewer incorrect predictions compared to the base Mask R-CNN implementation. Additionally, CR-CNN demonstrated improved performance compared to YOLOv5 and DETR on both aerial datasets.

The advantage of CR-CNN is that object proposal prediction—the foundation of all correct or incorrect predictions—occurs from two unique approaches. While it is inevitable that the two branches will predict a large number of identical proposals, each will also produce a number of unique proposals—creating a better foundation for detecting objects that are usually missed. Additionally, the keypoint-based branch is tasked with the simplest form of detection—simply predicting whether an object does or does not exist—while providing valuable information to the second stage of the network. Meanwhile, approaches such as GANs and super-resolution approaches perform complex object prediction on enhanced, or already predicted, images—compounding incorrect predictions.

Comparison of the performance of equivalent CR-CNN architectures when trained with and without the inflated training, as in Table 9, reveals that the training strategy plays only a small role in improving the accurate detection of smaller objects. On the synthetic data, the training strategy leads to increases in AP_S of up to 1.2 points. On the UAV datasets, the strategy increases the AP₅₀ score by up to 1.1 points. This suggests that the additional RoI predictions from the second RPN branch are the primary drivers of improving the detection of smaller objects.

The improved detection of small objects by CR-CNN can be understood better from the composition of the architecture’s predictions on the UAV data, as shown in Table 5. The architecture introduced achieves two improvements: falsely predicting the existence of objects at a lower rate compared to Mask R-CNN—represented as fewer false positives (FPs)—and predicting the correct existence of more objects compared to Mask R-CNN, as shown by the higher number of true positive predictions (TPs). This supports the idea that the additional branch produces high-quality proposals and reduces the number of poor proposals from the original Mask R-CNN’s first stage that are utilized in the final predictions.

While CR-CNN was trained using the IoU inflation strategy in order to improve the performance of the classical RPN on smaller objects, this strategy could easily be adapted to any anchor-based architecture, such as an R-CNN or YOLO architecture. An inflated Mask R-CNN and YOLOv5 achieve an improved detection accuracy over a baseline model on simple synthetic data, as shown in Table 6, and on more complex aerial data, as shown in Table 9. On the COCO dataset, Mask R-CNN assigns an average of 1.00 anchors to each small object but assigns an average of 1.12 anchors per object when trained using our inflation strategy. YOLOv5 demonstrates similar improvements on the Aerial-Cars data by increasing the average number of small object–anchor pairs labeled as positive examples from 1.00 to 1.05. These results are summarized in Figure 8. The inflation strategy can also achieve higher average precision scores compared to data augmentation strategies such as [35]. Despite the large differences between R-CNN and YOLO detectors, the inflated training strategy produces an improved performance for both architectures. These results confirm that increasing the number of objects seen during training will lead to improved detection accuracy.

The inflated IoU approach also increased the detection of medium and large objects on both aerial datasets and increased the accurate prediction of medium but not larger objects on the COCO dataset, as shown in Table 7. This suggests that there is an upper limit on the object size for which the inflated IoU training produces improved results. While the aerial datasets have larger objects, their size is still often dwarfed by some of the largest objects within COCO. This would be an important consideration for utilizing the inflation strategy in different applications.

UAV-based imagery is a challenging environment for state-of-the-art detectors due to its high proportion of small objects, which are often dwarfed by the background and comprise only small portions of an image. We proposed an R-CNN architecture that predicted additional RoIs from an objectness heatmap which could extract the existence of small objects better. This two-branch RPN was then trained using a strategy that increased the visibility of smaller objects to improve the detection of small objects by an R-CNN architecture further. With the ever-increasing accessibility of UAVs, this work evaluated the improved performance of CR-CNN on aerial-based data as a demonstration of the real-world applicability of this architecture.

The success of the CR-CNN method suggests that the classical RPN of R-CNN architectures could be supplemented in any number of ways with further research. While CR-CNN demonstrated improved performance, it was almost certainly at the cost of increased computational complexity. While this was not a focus of this work, it would be a valuable direction of future research to explore simpler algorithms, including real-time approaches, inspired by this dual-proposal approach. Additionally, CR-CNN was tested only on aerial-based datasets, and it would be beneficial to further benchmark the proposed method on other real-world data, such as CCTV footage, to explore the general applicability of methods utilizing dual first-stage prediction branches. Additionally, IoU inflation is only one method for increasing the visibility of under-represented objects while training detection models and further work in this area would be beneficial. We did not pursue rigorous optimization of the parameters for this approach either, and the methodology could likely be optimized. However, this approach highlights that all components of the object detection pipeline should remain important aspects of future research.

Author Contributions

The contribution of the two authors to this work was as follows: conceptualization, J.B. and H.L.; methodology, J.B.; software, J.B.; validation, J.B.; formal analysis, J.B.; investigation, J.B.; resources, J.B.; data curation, J.B.; writing—original draft preparation, J.B.; writing—review and editing, J.B. and H.L.; visualization, J.B.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Alberta Major Innovation Fund, the Natural Sciences and Engineering Research Council (NSERC) and Alberta Student Aid.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The funders had no role in the design of this study; in the collection, analyses or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

References

Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Tong, K.; Wu, Y. Deep learning-based detection from the perspective of small or tiny objects: A survey. Image Vis. Comput. 2022, 123, 104471. [Google Scholar] [CrossRef]
Quan, Y.; Zhang, D.; Zhang, L.; Tang, J. Centralized Feature Pyramid for Object Detection. IEEE Trans. Image Process. 2023, 32, 4341–4354. [Google Scholar] [CrossRef]
Ren, Y.; Zhu, C.; Xiao, S. Small Object Detection in Optical Remote Sensing Images via Modified Faster R-CNN. Appl. Sci. 2018, 8, 813. [Google Scholar] [CrossRef]
Pang, J.; Li, C.; Shi, J.; Xu, Z.; Feng, H. R² -CNN: Fast Tiny Object Detection in Large-Scale Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5512–5524. [Google Scholar] [CrossRef]
Hamaguchi, R.; Fujita, A.; Nemoto, K.; Imaizumi, T.; Hikosaka, S. Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1442–1450. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Song, T.; Sun, L.; Xie, D.; Sun, H.; Pu, S. Small-scale Pedestrian Detection Based on Topological Line Localization and Temporal Feature Aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Han, B.; Wang, Y.; Yang, Z.; Gao, X. Small-Scale Pedestrian Detection Based on Deep Neural Network. IEEE Trans. Intell. Transp. Syst. 2020, 21, 3046–3055. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Kellenberger, B.; Volpi, M.; Tuia, D. Fast animal detection in UAV images using convolutional neural networks. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 866–869. [Google Scholar] [CrossRef]
Hausamann, D.; Zirnig, W.; Schreier, G.; Strobl, P. Monitoring of gas pipelines—A civil UAV application. Aircr. Eng. Aerosp. Technol. 2005, 77, 352–360. [Google Scholar] [CrossRef]
Sa, I.; Hrabar, S.; Corke, P. Outdoor Flight Testing of a Pole Inspection UAV Incorporating High-speed Vision. In Field and Service Robotics: Results of the 9th International Conference; Mejias, L., Corke, P., Roberts, J., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 107–121. [Google Scholar] [CrossRef]
Doherty, P.; Rudol, P. A UAV Search and Rescue Scenario with Human Body Detection and Geolocalization. In Proceedings of the AI 2007: Advances in Artificial Intelligence, Osnabrück, Germany, 10–13 September 2007; pp. 1–13. [Google Scholar]
Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Li, Z.; Xiong, X.; Khyam, M.O.; Sun, C. Robust Vehicle Detection in High-Resolution Aerial Images with Imbalanced Data. IEEE Trans. Artif. Intell. 2021, 2, 238–250. [Google Scholar] [CrossRef]
Rabbi, J.; Ray, N.; Schubert, M.; Chowdhury, S.; Chao, D. Small-Object Detection in Remote Sensing Images with End-to-End Edge-Enhanced GAN and Object Detector Network. Remote Sens. 2020, 12, 1432. [Google Scholar] [CrossRef]
Wang, X.; Wang, A.; Yi, J.; Song, Y.; Chehri, A. Small Object Detection Based on Deep Learning for Remote Sensing: A Comprehensive Review. Remote Sens. 2023, 15, 3265. [Google Scholar] [CrossRef]
Courtrai, L.; Pham, M.T.; Lefèvre, S. Small Object Detection in Remote Sensing Images Based on Super-Resolution with Auxiliary Generative Adversarial Networks. Remote Sens. 2020, 12, 3152. [Google Scholar] [CrossRef]
Fu, Y.; Zheng, C.; Yuan, L.; Chen, H.; Nie, J. Small Object Detection in Complex Large Scale Spatial Image by Concatenating SRGAN and Multi-Task WGAN. In Proceedings of the 2021 7th International Conference on Big Data Computing and Communications (BigCom), Deqing, China, 13–15 August 2021; pp. 196–203. [Google Scholar] [CrossRef]
Cao, G.; Xie, X.; Yang, W.; Liao, Q.; Shi, G.; Wu, J. Feature-fused SSD: Fast detection for small objects. In Proceedings of the Ninth International Conference on Graphic and Image Processing (ICGIP 2017), Qingdao, China, 14–16 October 2017; Yu, H., Dong, J., Eds.; International Society for Optics and Photonics. SPIE: Bellingham, WA, USA, 2018; Volume 10615, p. 106151E. [Google Scholar] [CrossRef]
Liang, Z.; Shao, J.; Zhang, D.; Gao, L. Small Object Detection Using Deep Feature Pyramid Networks. In Proceedings of the Advances in Multimedia Information Processing—PCM 2018, Hefei, China, 21–22 September 2018; pp. 554–564. [Google Scholar]
Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar] [CrossRef]
Hu, X.; Xu, W.; Gan, Y.; Su, J.; Zhang, J. Towards Disturbance Rejection in Feature Pyramid Network. IEEE Trans. Artif. Intell. 2023, 4, 946–958. [Google Scholar] [CrossRef]
Huang, W.; Li, G.; Chen, Q.; Ju, M.; Qu, J. CF2PN: A Cross-Scale Feature Fusion Pyramid Network Based Remote Sensing Target Detection. Remote Sens. 2021, 13, 847. [Google Scholar] [CrossRef]
Singh, B.; Davis, L.S. An Analysis of Scale Invariance in Object Detection—SNIP. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3578–3587. [Google Scholar] [CrossRef]
Singh, B.; Najibi, M.; Davis, L.S. SNIPER: Efficient Multi-Scale Training. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Eggert, C.; Brehm, S.; Winschel, A.; Zecha, D.; Lienhart, R. A closer look: Small object detection in faster R-CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 421–426. [Google Scholar] [CrossRef]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhou, J.; Vong, C.M.; Liu, Q.; Wang, Z. Scale adaptive image cropping for UAV object detection. Neurocomputing 2019, 366, 305–313. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Terrail, J.O.D.; Jurie, F. On the use of deep neural networks for the detection of small vehicles in ortho-images. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 4212–4216. [Google Scholar] [CrossRef]
Plastiras, G.; Siddiqui, S.; Kyrkou, C.; Theocharides, T. Efficient Embedded Deep Neural-Network-based Object Detection Via Joint Quantization and Tiling. In Proceedings of the 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genova, Italy, 31 August–2 September 2020; pp. 6–10. [Google Scholar] [CrossRef]
Ozge Unel, F.; Ozkalayci, B.O.; Cigla, C. The Power of Tiling for Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef]
Gong, M.; Zhao, H.; Wu, Y.; Tang, Z.; Feng, K.Y.; Sheng, K. Dual Appearance-Aware Enhancement for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5602914. [Google Scholar] [CrossRef]
Yao, Y.; Cheng, G.; Lang, C.; Yuan, X.; Xie, X.; Han, J. Hierarchical Mask Prompting and Robust Integrated Regression for Oriented Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 1. [Google Scholar] [CrossRef]
Tang, T.; Zhou, S.; Deng, Z.; Zou, H.; Lei, L. Vehicle Detection in Aerial Images Based on Region Convolutional Neural Networks and Hard Negative Example Mining. Sensors 2017, 17, 336. [Google Scholar] [CrossRef]
Wu, Z.; Suresh, K.; Narayanan, P.; Xu, H.; Kwon, H.; Wang, Z. Delving Into Robust Object Detection From Unmanned Aerial Vehicles: A Deep Nuisance Disentanglement Approach. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1201–1210. [Google Scholar] [CrossRef]
Li, W.; Wei, W.; Zhang, L. GSDet: Object Detection in Aerial Images Based on Scale Reasoning. IEEE Trans. Image Process. 2021, 30, 4599–4609. [Google Scholar] [CrossRef]
Weng, W.; Wei, M.; Ren, J.; Shen, F. Enhancing Aerial Object Detection with Selective Frequency Interaction Network. IEEE Trans. Artif. Intell. 2024, 1–12. [Google Scholar] [CrossRef]
Ammar, A.; Koubaa, A.; Ahmed, M.; Saad, A.; Benjdira, B. Vehicle Detection from Aerial Images Using Deep Learning: A Comparative Study. Electronics 2021, 10, 820. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Tao, X.; Michael, K.; Fang, J.; Imyhxy; et al. Ultralytics/yolov5: v6.2—YOLOv5 Classification Models, Apple M1, Reproducibility, ClearML and Deci.ai integrations. Zenodo 2022. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Mittal, P.; Singh, R.; Sharma, A. Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis. Comput. 2020, 104, 104046. [Google Scholar] [CrossRef]
Robicquet, A.; Sadeghian, A.; Alahi, A.; Savarese, S. Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 549–565. [Google Scholar]
Kharuzhy, Y. Aerial-Cars-Dataset. 2018. Available online: https://github.com/jekhor/aerial-cars-dataset (accessed on 1 December 2019).
Wang, J.; Simeonova, S.; Shahbazi, M. Orientation- and Scale-Invariant Multi-Vehicle Detection and Tracking from Unmanned Aerial Videos. Remote Sens. 2019, 11, 2155. [Google Scholar] [CrossRef]
Yu, H.; Li, G.; Zhang, W.; Huang, Q.; Du, D.; Tian, Q.; Sebe, N. The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline. Int. J. Comput. Vis. 2020, 128, 1141–1159. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-Based Models for Speech Recognition. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019. [Google Scholar] [CrossRef]
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
Insafutdinov, E. Towards Accurate Multi-Person Pose Estimation in the Wild. 2020. Available online: https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/31184 (accessed on 26 October 2024).
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Yang, Z.; Yu, H.; Feng, M.; Sun, W.; Lin, X.; Sun, M.; Mao, Z.H.; Mian, A. Small Object Augmentation of Urban Scenes for Real-Time Semantic Segmentation. IEEE Trans. Image Process. 2020, 29, 5175–5190. [Google Scholar] [CrossRef]
Dwibedi, D.; Misra, I.; Hebert, M. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1310–1319. [Google Scholar] [CrossRef]
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 1 January 2020).

Figure 1. Example Aerial-Cars image. The additional annotated objects (denoted by red squares) are primarily located towards the horizon and in the upper-left and -right corners.

Figure 2. Histogram of object sizes in the synthetic and aerial-based data explored in this work.

Figure 3. Complete CR-CNN architecture. The portions within the red hashed borders are standard implementations of Mask R-CNN utilizing ResNeXt-101 and PANet. The secondary RoI prediction branch proposed in this work is contained within the blue hashed border with the RPN utilizing the architecture of CR-CNN-off or CR-CNN-abs, as discussed in Materials and Methods.

Figure 4. A cropped image from the Aerial-Cars detection dataset demonstrating how a portion of the anchors (green) is overlaid onto each object’s groundtruth label (red) in order to calculate the IoU. Artificially inflating the IoU of the object–anchor pairs for small objects increases the number of anchors used during training—effectively increasing the object size of the label annotation.

Figure 5. Various IoU inflation functions tested for improved detection of small objects. The Gaussian function resulted in the largest increase in small object performance with minimal decreases in larger object accuracy.

Figure 6. CR-CNN predictions on Aerial-Cars Figure MOS157 [54] with a prediction threshold of 0.25. Predictions are represented by each blue box.

Figure 7. CR-CNN predictions on Aerial-Cars Figure DJI-0762-00001 [54] with a prediction threshold of 0.25. This image represents one of the poorer performances of CR-CNN on the data, with missed predictions in regions with changing light and contrast. Predictions are represented by each blue box.

Figure 8. Average number of matched anchors.

Table 1. Object size and density in COCO and aerial datasets.

	Object Size	Number of Objects	Percent of Total Objects	Percent of Total Object Area	Percent of Total Image Area
COCO	Small	267,700	31.13	0.58	0.33
	Medium	300,129	34.90	5.99	3.39
	Large	292,172	33.97	93.44	52.82
Vehicle-Detection	Small	15,478	80.0	66.51	0.61
	Medium	3870	20.0	33.49	0.28
	Large	0	0.0	0.0	0.0
Aerial-Cars	Small	539	16.98	3.98	0.16
	Medium	2415	76.09	60.88	2.50
	Large	220	06.93	35.14	1.44

Table 2. Detection accuracy of Mask R-CNN and YOLOv5 on Aerial-Cars trained utilizing various inflation curves.

(a) Mask R-CNN
	Gaussian	Step	Square	Sigmoid
AP₅₀	76.4	72.6	73.6	73.2
AP	41.0	39.6	39.8	38.7
AP_S	38.7	38.5	38.0	35.3
AP_M	41.2	39.7	40.5	41.0
AP_L	57.1	52.0	53.2	51.4
(b) YOLOv5
	Gaussian	Step	Square	Sigmoid
AP₅₀	72.0	62.3	58.5	61.1
AP	35.7	30.0	28.9	27.6
AP_S	32.6	28.9	22.8	26.1
AP_M	36.6	26.4	24.1	29.0
AP_L	36.5	32.8	26.8	28.4

Table 3. CR-CNN’s results on synthetic data.

	AP	AP₅₀	AP_S	AP_M
Mask R-CNN	55.9	76.1	53.8	72.0
YOLOv5	60.3	72.5	63.7	73.5
CR-CNN-abs	60.8	77.0	60.4	73.1
CR-CNN-off	57.2	76.6	60.1	73.5

All models are trained using the inflated IoU strategy.

Table 4. CR-CNN’s results on UAV-based data.

(a) Vehicle-Detection
	AP	AP₅₀	AP_S	AP_M
Mask R-CNN	26.9	47.4	31.5	20.8
YOLOv5	18.6	39.0	20.1	8.3
DETR	19.0	50.1	19.1	23.2
CR-CNN-abs	29.8	56.9	34.0	19.4
CR-CNN-off	28.4	54.7	33.4	19.3
(b) Aerial-Cars
	AP	AP₅₀	AP_S	AP_M	AP_L
Mask R-CNN	40.2	73.7	38.2	40.4	51.5
YOLOv5	23.0	55.0	20.9	24.2	6.9
DETR	38.2	74.8	35.8	38.9	33.6
CR-CNN-abs	46.6	78.9	43.8	43.6	52.7
CR-CNN-off	46.0	77.4	41.1	43.8	52.7

All models are trained using the inflated IoU strategy.

Table 5. Composition of CR-CNN predictions on UAV data.

	Vehicle-Detection			Aerial-Cars
	TP	FP	FN	TP	FP	FN
Mask R-CNN	1563	94	2048	850	36	286
CR-CNN-abs	1794	88	1853	881	33	255
CR-CNN-off	1790	86	1857	879	32	257

A prediction is considered to be positive at an IoU greater than 0.7 and false when the prediction’s IoU is less than 0.7.

Table 6. Inflated IoU training applied to standard detection architectures on synthetic data.

	AP	AP₅₀	AP_S	AP_M
Mask R-CNN ^†	55.0	74.4	53.2	71.3
Mask R-CNN	55.9	76.1	53.8	72.0
Mask ^† + C&P	52.6	71.2	49.3	70.0
YOLOv5 ^†	60.1	71.2	58.9	72.9
YOLOv5 ^† + C&P	34.3	39.5	34.2	44.6
YOLOv5	60.3	72.5	63.7	73.5

C&P refers to the copy-and-paste method from [35], copying some of the objects within an image three times. ^† Models trained without the inflation strategy.

Table 7. Inflated IoU training applied to standard detection architectures on COCO 2017.

	AP	AP₅₀	AP_S	AP_M	AP_L
Mask R-CNN ^†	42.0	62.5	25.2	45.6	54.6
Mask R-CNN	42.2	63.1	26.3	45.7	54.5
Mask ^† + C&P	42.0	62.8	25.5	45.5	54.5
YOLOv5 ^†	48.9	67.5	31.8	54.4	62.3
YOLOv5 ^† + C&P	49.2	67.4	31.9	54.2	62.4
YOLOv5	50.0	67.9	31.9	56.1	61.8

C&P refers to the copy-and-paste method from [35], copying some of the objects within an image three times. ^† Models trained without the inflation strategy.

Table 8. Inflated IoU training applied to Mask R-CNN segmentations on COCO 2017.

	AP	AP₅₀	AP_S	AP_M	AP_L
Mask R-CNN ^†	38.0	58.3	19.8	41.2	50.1
Mask R-CNN	39.0	58.8	21.1	41.4	50.1
Mask ^† + C&P	38.1	58.5	19.8	41.3	50.2

C&P refers to the copy-and-paste method from [35], copying some of the objects within an image three times. ^† Models trained without the inflation strategy.

Table 9. Inflated IoU training applied to standard detection architectures on real-world aerial data.

(a) Vehicle-Detection
	AP	AP₅₀	AP_S	AP_M
Mask R-CNN ^†	25.6	46.1	28.6	19.0
Mask ^† + C&P	25.4	46.3	28.7	19.2
Mask R-CNN	26.9	47.4	31.5	20.8
YOLOv5 ^†	17.2	37.8	18.2	7.1
YOLOv5 ^† + C&P	16.9	37.9	18.8	7.9
YOLOv5	18.6	39.0	20.1	8.3
(b) Aerial-Cars
	AP	AP ₅₀	AP_S	AP_M	AP_L
Mask R-CNN ^†	40.2	73.7	38.2	40.4	51.5
Mask ^† + C&P	40.3	74.4	37.0	40.8	48.6
Mask R-CNN	41.0	76.4	38.7	41.2	57.1
YOLOv5 ^†	23.0	55.0	20.9	24.2	6.9
YOLOv5 ^† + C&P	36.0	71.9	31.9	36.8	39.6
YOLOv5	35.7	72.0	32.6	36.6	36.5

C&P refers to the copy-and-paste method from [35], copying some of the objects within an image three times. ^† Models trained without the inflation strategy.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Butler, J.; Leung, H. A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection. Remote Sens. 2024, 16, 4065. https://doi.org/10.3390/rs16214065

AMA Style

Butler J, Leung H. A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection. Remote Sensing. 2024; 16(21):4065. https://doi.org/10.3390/rs16214065

Chicago/Turabian Style

Butler, Justin, and Henry Leung. 2024. "A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection" Remote Sensing 16, no. 21: 4065. https://doi.org/10.3390/rs16214065

APA Style

Butler, J., & Leung, H. (2024). A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection. Remote Sensing, 16(21), 4065. https://doi.org/10.3390/rs16214065

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection

Abstract

1. Introduction

Small Object Detection

2. Materials and Methods

2.1. Datasets

2.2. Architecture

2.2.1. Mask R-CNN

2.2.2. CR-CNN

2.3. Inflated IoU Training

3. Results

3.1. Architecture-Derived Improvements

3.2. Inflation-Derived Improvements

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI