1. Introduction
Object detection is one of the most fundamental and challenging problems in computer vision. It serves as a prerequisite for a wide range of downstream applications, such as instance segmentation [
1], pose estimation [
2], surveillance [
3], and autonomous driving [
4]. With the development of deep learning, remarkable progress in object detection has been made in recent years. Single-stage frameworks, such as RetinaNet [
5] and Fcos [
6], and two-stage frameworks, such as Faster R-CNN [
7] and Cascade R-CNN [
8], have substantially pushed forward the state of the art. Despite the apparent differences in the frameworks, object detection is usually formulated as two main tasks: One is a classification task to distinguish foreground from background and determine which category the object belongs to; the other is a localization task to regress a set of coefficients that localize the object as accurately as possible. For the duplicated detections matched with the same object, only the one with the highest score is considered true positive, others are considered false positive. However, in traditional Non-Maximum Suppression (NMS) to remove duplicated detections, the classification score is used as the ranking keyword, leading to a misalignment between the classification score and the localization accuracy. In this case, more accurately localized bounding boxes could be suppressed by less accurate ones but with higher classification scores.
Recently, a localization quality estimation branch, which is usually paralleled with the main branches of classification and bounding box regression, is introduced and leads to an encouraging advancement in the field of object detection. Fcos [
6] and IoU-Net [
9] predict the centerness score and the Intersection-over-Union (IoU) value, respectively, to estimate the localization quality. There are some differences between the two estimation scores. For the centerness score, it is used in anchor-free detectors to filter out the low-quality bounding boxes predicted by a location far from the center of an object. For the IoU value, it is used in anchor-based detectors to solve the misalignment between classification confidence and localization accuracy. At inference, the IoU is predicted on the detected bounding boxes, while the centerness is performed on each location irrelevant to the detected boxes. Thus, the IoU value between the detected bounding box and the match ground-truth is more correlated with the localization accuracy. In this paper, we focus on the estimation of IoU value and mainly study the sampling strategy for an estimation branch of IoU regression.
Currently, there exist two sampling strategies to train the IoU regression branch. First, ref. [
10] uses the same training samples with the main branches of classification and bounding box regression. All three branches take samples from pre-defined anchor boxes. Second, ref. [
9] adopts a uniform sampling w.r.t the IoU by manually augmenting the ground-truth to generate samples. The IoU branch and the two main branches are trained independently. For the existing strategies, we observe the following problems.
In the first case, the distributions of training and inference are consistent for the main branches but inconsistent for the IoU branch. In training, for each location on the feature map, the pre-defined anchor boxes are selected as training samples for the three branches. At inference, the three branches play different parts in the object detection pipeline. Specifically, given a set of anchors, the bounding box regression branch transforms the anchors to best fit the ground-truth; the classification branch distinguishes the categories of anchors through the probabilities for each class label; the IoU estimation branch predicts the quality of the detected bounding boxes. For classification and bounding box regression, there is consistency between training and inference distributions, which are all performed on the anchors. The consistency of distributions is a common paradigm in deep learning, which is a simple yet effective training strategy empirically. Following this paradigm, the consistency of distributions enables the network to efficiently learn good representations of a specific distribution consistent with inference. However, there is a discrepancy between training and inference distributions for the IoU regression branch which is performed on the detected bounding boxes at inference but trained on the anchors. This inconsistency inevitably induces ineffective learning.
In the second case, Ref. [
9] first generates candidate samples by manually augmenting the ground-truth, and then it uniformly samples training examples from this candidate set w.r.t all IoU levels (>0.5). The manual augmentation assembles enough training examples for the uniform sampling. Compared with the uniform sampling from manual augmentation, the main tasks of classification and bounding box regression adopt random sampling from the Region Proposal Network (RPN) proposals. There is a dilemma that the training samples are generated by manually augmenting the ground-truth rather than using the RPN proposals. This brings three drawbacks: (1) The breaking of the unified training samples. The RPN proposals are used for both classification and bounding box regression, while extra samples are manually generated for training the IoU regression. (2) The ineffective learning of outlier. When all levels of samples are selected for the IoU regression, the samples of outliers, which are not consistent with the inference distribution, will lead to ineffective learning. (3) The increasing of learning difficulty. Ideally, the IoU regressor, based on the uniform sampling, is expected to be optimal at all IoU levels. However, the ideal regressor inevitably enhances the difficulty of learning.
One question is whether uniform sampling is necessary to train an IoU regressor. We analyze this question from two perspectives: (1) The intent of the IoU regression task. Different from the intent of bounding box regression which aims to be infinitely close to the target box, the objective of the IoU value is to distinguish which has more accurate localization for the two overlapping bounding boxes in the NMS procedure. In other words, rather than the accurate IoU value, the IoU regressor is intent to have the characteristic of distinction between two overlapping boxes. (2) The effectiveness of a single IoU regressor to all IoU levels. In Cascade R-CNN [
8], it suggests that each bounding box regressor performs best for the corresponding IoU level that the regressor was trained. A cascaded regression is proposed, such that the regressors deeper into the cascade are sequentially optimized to higher IoU levels. Thus the difficulty of bounding box regression to different IoU levels is decomposed into a sequence of stages. For a single regressor, the light head is usually framed as two fully connected layers, resulting into that the capacity of learning is limit. Compared with the cascaded bounding box regressors, the learning of good representations for all IoU levels is more difficult for a single and light IoU regressor. In conclusion, the easier task of IoU regression uses a more complex strategy of uniform sampling. For the single IoU regressor, it is uncertain whether the uniform sampling leads to effective learning for all IoU levels.
In this work, we aim to improve the performance of the two-stage Faster R-CNN by an additional IoU regression task and solve the mentioned sampling problems for the IoU regression branch. To distinguish from the two main tasks, the IoU regression task is also called the auxiliary task. Based on the fact that the IoU regressor is operated on the detected bounding boxes (the RPN proposals after regressed) at inference, sampling from the regressed RPN proposals can guarantee the consistent distributions for training and inference. Thus, we can solve the inconsistency in the first sampling case. It is worth noting that the regressed RPN proposals are heavily titled toward high IoU levels and the RPN proposals toward low IoU levels. When sampling from the regressed RPN proposals, the IoU regressor can learn good representations for high IoU distribution consistent with inference. Ideally, the IoU regressor should be optimal at all IoU levels. However, the uniform sampling by manually augmenting the ground-truth, which inevitably enhances the difficulty of learning, is uncertainly effective to all IoU levels. A compromise of uniform sampling is selecting low IoU samples and high IoU samples to train the branch of IoU regressor, resulting in an IoU regressor that is optimal at not only high IoU levels but also low IoU levels in a single structure. In this manner, we can reduce the learning difficulty in the second case.
In this paper, we introduce an auxiliary IoU regression branch based on Faster R-CNN, which is called IoU-Aware R-CNN. We propose an H+L-Sampling strategy to select the high and low IoU samples simultaneously, in which the low IoU samples are selected from the RPN proposals and the high IoU samples are obtained by transforming the low IoU samples. The high IoU samples satisfy the consistent sampling. On the basis of the consistent samples, adding the existing low IoU samples brings negligible computation burden and can still substantially improve the performance and robustness. This strategy inherits the effectiveness of consistent sampling and reduces the difficulty of uniform sampling, resulting in an IoU regressor that is optimal at both low and high IoU levels. This simple but powerful branch demonstrates significant improvement in detection performance. Our IoU regression is still powerful when trained on few samples, which requires few computational resources and more compatible with real-world applications. The probabilities of categories reflect classification confidence, and the predicted IoU values between the detected bounding boxes and the ground-truth reflect localization confidence. Finally, We combine the predicted IoU with the probability as the final detection confidence for the rank process of NMS, removing duplicated bounding boxes and preserving accurately localized bounding boxes.
In summary, the main contributions of this paper are listed as follows:
We propose an H+L-Sampling strategy, which satisfies the consistency of distributions and the low difficulty of learning, to train an additional IoU regression branch in our IoU-Aware R-CNN. For the auxiliary task, our sampling is simple and effective to estimate the localization accuracy.
In the whole of our IoU-Aware detector, we have a unified structure in both training and inference. Rather than manually augmenting the ground-truth, all three branches take samples from the RPN proposals in training. In the post-process of NMS, the detection confidence is proposed, which encodes the probability of that class appearing in the box and the accuracy of the predicted box localizing the object, simultaneously.
Extensive experiments show the effectiveness of our sampling strategy to solve the problem of absent localization accuracy, as well as its simplicity but competitiveness even compared with several state-of-the-art object detectors. Due to its effectiveness and simplicity, our IoU regression branch can be compatible with most two-stage detectors.
3. Proposed Method
In this section, we propose the IoU-Aware R-CNN, adding an auxiliary branch of IoU regression based on Faster R-CNN, as shown in
Figure 1. The auxiliary branch of IoU regression is paralleled from the main branches of classification and bounding box regression. The three branches play different parts in the two-stage detection pipeline. Specifically, given a set of RPN proposals
, the bounding box regressor transforms the RPN proposals to best fit the ground-truth; the classifier distinguishes the categories of the RPN proposals through the softmax probabilities for each class label; different from the main branches which are performed on the RPN proposals, the IoU regressor predicts the quality of the detected bounding boxes
. Thus, for each detected bounding box, there are two confidences to reflect the performance of detection: The classification score (
) indicates the probability of which category the bounding box belongs to; the localization IoU (
) with the corresponding ground-truth indicates the localization accuracy of the bounding box. The multiplication of the two confidences is used as the final detection confidence (
) for the rank process of NMS during inference. In the following, we show the details of our method.
3.1. Separate Sampling for the Main and Auxiliary Branches
In our IoU-Aware R-CNN, the sampling of the auxiliary branch of IoU regression is separate from the sampling of the two main branches, shown in
Figure 2. There are two reasons for this design: (1) The most widely adopted sampling method for the two main branches is the random sampling with a fixed positive-to-negative ratio, like in Faster R-CNN [
24]. Besides, more efficient sampling strategies are proposed to improve the performance. One popular idea is hard mining to select hard samples, such as OHEM [
19], Libra R-CNN [
20], and RetinaNet [
5]. Recently PISA [
21] focus on the prime samples which have a greater influence on the performance of object detection. Using separate sampling for IoU regression makes the auxiliary branch more compatible with these detectors that adopt the mentioned sampling strategies for the main branches. (2) When using the separate sampling, it is convenient for us to study the ablation experiments and design a simple sampling only focusing on the positive for IoU regression task.
For the main branches of classification and bounding box regression, we adopt the random sampling following Faster R-CNN. For all RPN proposals, they highly overlap with each other. Before sampling, the NMS with a fixed IoU threshold of 0.7 is adopted to reduce redundancy, getting the set of RPN proposals
in
Figure 2. Note that the IoU regression branch also selects samples from the same candidate set
. For each
, if the IoU with the ground-truth is greater than 0.5, we assign a positive label to
. Otherwise, we assign a negative label. Finally, we randomly select positive and negative samples
with a fixed positive-to-negative ratio.
For the auxiliary branch of IoU regressor, it is operated on the detected bounding box to predict its IoU value with the corresponding ground-truth at inference. Corresponding to
Figure 1, given the set of RPN proposals
, the detected bounding boxes
can be computed by:
where
is the bounding box regressor taking
c as parameters. The core idea of a bounding box regressor is that a network directly learns to transform a bounding box to its designated target. Inspired by the observation in Cascade R-CNN [
8] that the output IoU of a bounding box regressor is almost invariably higher than the input IoU, the regressed
would have a higher IoU level than
. We show some qualitative results of the head-to-head comparison in
Figure 3.
For the IoU regression task, we propose a simple and effective strategy in
Figure 2, H+L-Sampling, to select high and low IoU samples simultaneously and train a single IoU regressor. First, we adopt the mentioned random sampling to only select positive samples from
and obtain the samples with low IoU
. Then Equation (
1) is performed on
to get the high IoU samples
which is a small part of
, satisfying the distribution consistency between training and inference for IoU regression. Finally, the two sets of samples are used to train a single IoU regressor. Note that multi-stage IoU regressors like cascaded structure in [
9] are unavailable for the task of IoU evaluation. Because, even if cascaded IoU regressors are separately trained by
and
, the IoU evaluation is performed on the final detections
and the same cascade procedure is inapplicable at inference.
We analyze the effectiveness of our proposed H+L-Sampling strategy: (1) is the most convenient and straight samples focusing on the low IoU distribution. While there is a small discrepancy between training and inference distributions, are feasible to train the IoU regressor in a simple manner. (2) focuses on high IoU distribution which is consistent with inference. This sampling guarantees the effective learning of a specific distribution consistent with inference. It is worth noting that the consistency means that the input distributions of both training and inference tend toward a specific IoU level but there might exist lower or higher IoU examples. Under the same number of samples, the consistent samples of is a better choice. (3) can be regarded as adding low IoU distribution on the basis of the consistent distribution, resulting in the diversity of samples. Ideally, the complete samples of training should include all IoU levels that maybe appear at inference. However, training samples are more, more difficult for learning. It is a trade-off between the diversity of samples and the difficulty of learning. Our H+L-Sampling, which selects the more effective samples and takes full advantage of the existing samples , is simple and effective.
Compared with the manually uniform sampling in [
9,
22], our proposed sampling differs from it in that: (1) In uniform sampling, manually augmenting the ground-truth is used to generate enough samples for each IoU interval. In our method, we take samples from the RPN proposals as same as the two main branches, resulting in a more unified sampling for the detector. (2) In the manual samples, the number of outliers that are not consistent with the inference distributions is more than the RPN proposals. While the RPN proposals are not completely consistent, the distribution difference between training and inference is smaller compared with the manual samples. (3) For the uniform sampling in [
22], it divides the IoU into 4 intervals, and each ground-truth keeps 64 samples for each IoU interval. Given an image with
K annotated ground-truths, the overall number of training samples is
. However, in default settings of our sampling, the number of RPN proposals and regressed RPN proposals is
, which are much fewer samples to train the IoU branch. The corresponding difficulty of learning is much reduced. (4) For our H+L-Sampling, it can be seen as two intervals: Low IoU level and high IoU level. For each interval, the only difference between the uniform sampling and the proposed sampling is the number of samples (
vs. 64). In our sampling, it is interesting to observe that when changing the sample number (64) to 32 or 128, it just has a difference of 0.1, which suggests that the IoU evaluation task, aimed to distinguish the localization accuracy of two overlapping bounding boxes, is insensitive to the number of samples. The insensitivity reflects that the dense samples (
) for each interval in uniform sampling are unnecessary for IoU regression.
3.2. Loss Function
For the two main branches, the classifier
assigns the candidate bounding box
x to one of the categories including background and the regressor
regresses the parameterized coordinates of the target bounding box associated with the candidate. Given the training set
, the loss function for the main branches follows Fast R-CNN [
24]:
Here,
i is the index of a training sample in a mini-batch,
is the class label of
x, and
is the 4 parameterized coordinates of the ground-truth. The classification loss
is the cross-entropy loss
, where
k is the index of categories. The regression loss
uses the
loss following the default setting in mmdetection [
25].
For the auxiliary branch, the regressor
regresses the target IoU value between the candidate bounding box
x and the corresponding ground-truth. Given the low IoU training set
and the high IoU set
, the loss function is defined as:
Here, j is the index of a training sample in a mini-batch, and are the target IoU of and , respectively. Note that two training sets are used to optimize a single regressor. The losses of low and high IoU samples are weighted by a balancing parameter . By default we set , focusing more on the consistent samples which are more effective.
Overall, we use a multi-task loss to jointly train the main and auxiliary branches:
For bounding box regression, the numerical value of can be very small, and is usually normalized by its mean and variance to improve the effectiveness of learning. For IoU regression, we note that the normalization is not required and could be simplified. The loss of to ratio is roughly 4:1. Due to the small weight of , the auxiliary branch almost does not affect the original outputs of the detector, which is more compatible with practical applications.
3.3. Detection Confidence for NMS
At inference, the overall pipeline of our two-stage detector of IoU-Aware R-CNN is shown in
Figure 4. In the first stage of RPN, each anchor generates an RPN proposal with a foreground score (
), and NMS based on the
is performed on all RPN proposals to choose top-N RPN proposals for detection. In the second stage of R-CNN, the top-N RPN proposals are refined by the four parameterized coordinates and generate N detected boxes with K classification scores (
) for each box. Then the IoU estimation is performed on the N detected boxes to predict their localization IoU (
) with the corresponding ground-truth. If the IoU regressor is class-agnostic, the K classes correspond to the same
; if class-aware, each of the K classes corresponds to its own
. For the N detected boxes with K classes, the detector finally outputs
80 = 80,000) detections with the classification confidence
and the localization confidence
. Note that if the bounding box regressor is class-aware, each of the K classes gets its own coordinates and the corresponding parameterized coordinates for each class are used to refine the RPN proposal, generating
detected boxes of which the difference between classes is slight. If we perform the IoU estimation on the
detected boxes, it will bring significant computation cost. So the IoU estimation is on the
N detected boxes with the maximum classes score, which almost has no influence on the detection performance.
The post-processor of NMS aims to remove duplicated detections and choose the top-100 final detections which are used to evaluate the detection performance. Before NMS, a threshold of score_thr is usually used to remove detections with scores lower than it. Note that there still exist a lot of detections with relatively low
, which cannot be distinguished only by the
. So we define detection confidence as:
which encodes both the probability of that class appearing in the predicted bounding box and how well the bounding box fits the object. The detection confidence
is used as the metric for ranking detections, and the suppression of duplicated detections is aware of the localization accuracy and the classification probability. Finally, NMS based on the ranking keyword of
is used to choose the top-100 detections, preserving detections with a more accurate localization. The difference of soft-NMS is replacing box elimination by the decrement of confidence. So the
is also suitable for soft-NMS and we show consistent improvements by experiments.