Figure 1.
Illustration of the complexity of real driving environments. In real driving environments, objects often appear in different kinds of scale and distance as in (a–c).
Figure 1.
Illustration of the complexity of real driving environments. In real driving environments, objects often appear in different kinds of scale and distance as in (a–c).
Figure 2.
Object detection system with proposed task-specific bounding box regressors (TSBBRs). Given a color image as input, general features are extracted by the shared backbone. With these features as input, TSBBRs fetch and decode their own related information followed by coordination and class prediction simultaneously. Finally, all predicted boxes would be merged and passed into the non-maximum-suppression (NMS) procedure to eliminate the redundant boxes.
Figure 2.
Object detection system with proposed task-specific bounding box regressors (TSBBRs). Given a color image as input, general features are extracted by the shared backbone. With these features as input, TSBBRs fetch and decode their own related information followed by coordination and class prediction simultaneously. Finally, all predicted boxes would be merged and passed into the non-maximum-suppression (NMS) procedure to eliminate the redundant boxes.
Figure 3.
Visual analysis of different settings of bounding box regressors (BBRs). (a) Tiny objects (A and B) would locate in the same grid box. Ambiguous back-propagation results in lower recall rate. (b) Large objects (A, B, C and D) that cross the boundary of many grid boxes, makes a numerical unstable phenomenon during the training stage.
Figure 3.
Visual analysis of different settings of bounding box regressors (BBRs). (a) Tiny objects (A and B) would locate in the same grid box. Ambiguous back-propagation results in lower recall rate. (b) Large objects (A, B, C and D) that cross the boundary of many grid boxes, makes a numerical unstable phenomenon during the training stage.
Figure 4.
Proposed conditional back-propagation mechanism. According to the ratio of the object short side and corresponding frame side of the ground-truth, TSBBRs would only produce losses on their prior set targets. These losses would be collected and summed up. Based on the summed losses, weights in model would be updated simultaneously.
Figure 4.
Proposed conditional back-propagation mechanism. According to the ratio of the object short side and corresponding frame side of the ground-truth, TSBBRs would only produce losses on their prior set targets. These losses would be collected and summed up. Based on the summed losses, weights in model would be updated simultaneously.
Figure 5.
The architecture of the proposed TSBBRs network consisting of 18 convolution layers.
Figure 5.
The architecture of the proposed TSBBRs network consisting of 18 convolution layers.
Figure 6.
Examples of augmented data. (a) Original images, (b) dropout, (c) color transformation, (d) rotation, (e) cropping.
Figure 6.
Examples of augmented data. (a) Original images, (b) dropout, (c) color transformation, (d) rotation, (e) cropping.
Figure 7.
Detection examples on Pascal VOC 2007 car database. Our model can detect the cars that are occluded, located near the border or hidden within other vehicles that are of analogous appearance.
Figure 7.
Detection examples on Pascal VOC 2007 car database. Our model can detect the cars that are occluded, located near the border or hidden within other vehicles that are of analogous appearance.
Figure 8.
Detection results on CarSim simulation software. (a) the result of the proposed model, detecting a vehicle 124 m (intersection over union (IoU) > 0.5) and 200 m (IoU > 0.3) away. (b) the result of YOLOv2, detecting the vehicle within 45 m (IoU > 0.5).
Figure 8.
Detection results on CarSim simulation software. (a) the result of the proposed model, detecting a vehicle 124 m (intersection over union (IoU) > 0.5) and 200 m (IoU > 0.3) away. (b) the result of YOLOv2, detecting the vehicle within 45 m (IoU > 0.5).
Table 1.
Configuration of backbone layers.
Table 1.
Configuration of backbone layers.
Type | Filters | Size/Stride | Output |
---|
Convolutional | 32 | 3 × 3 | 224 × 224 |
Max-pooling | | 2 × 2/2 | 112 × 112 |
Convolutional | 64 | 3 × 3 | 112 × 112 |
Max-pooling | | 2 × 2/2 | 56 × 56 |
Convolutional | 128 | 3 × 3 | 56 × 56 |
Convolutional | 64 | 1 × 1 | 56 × 56 |
Convolutional | 128 | 3 × 3 | 56 × 56 |
Max-pooling | | 2 × 2/2 | 28 × 28 |
Convolutional | 256 | 3 × 3 | 28 × 28 |
Convolutional | 128 | 1 × 1 | 28 × 28 |
Convolutional | 256 | 3 × 3 | 28 × 28 |
Max-pooling | | 2 × 2/2 | 14 × 14 |
Convolutional | 512 | 3 × 3 | 14 × 14 |
Convolutional | 256 | 1 x 1 | 14 × 14 |
Convolutional | 512 | 3 × 3 | 14 × 14 |
Convolutional | 256 | 1 × 1 | 14 × 14 |
Convolutional | 512 | 3 × 3 | 14 × 14 |
Max-pooling | | 2 × 2/2 | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 512 | 1 × 1 | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 512 | 1 × 1 | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Table 2.
Configuration of TSBBR for tiny objects.
Table 2.
Configuration of TSBBR for tiny objects.
Type | Filters | Size/Stride | Output |
---|
Convolutional | 1024 | 3 × 3 | 14 × 14 |
Convolutional | 512 | S × S | 14 × 14 |
Convolutional | 1024 | 3 × 3 | 14 × 14 |
Convolutional | 512 | S × S | 14 × 14 |
Convolutional | 1024 | 3 × 3 | 14 × 14 |
Convolutional | 18 | 3 × 3 | 14 × 14 |
Table 3.
Configuration of TSBBR for large objects.
Table 3.
Configuration of TSBBR for large objects.
Type | Filters | Size/Stride | Output |
---|
Max-pooling | | 2 × 2/2 | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 512 | S × S | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 512 | S × S | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 28 | 3 × 3 | 7 × 7 |
Table 4.
The mean average precision (mAP) of different backbone layers.
Table 4.
The mean average precision (mAP) of different backbone layers.
Model | mAP (%) | Vehicle AP (%) | Bike AP (%) | Pedestrian AP (%) | FPS |
---|
Alexnet + BBR | 28.0 | 37.7 | 36.7 | 9.5 | 330.9 |
GoogleNet + BBR | 48.6 | 69.5 | 65.8 | 10.4 | 94.3 |
Darknet-19 + BBR | 51.9 | 61.7 | 36.1 | 57.7 | 92.3 |
ResNet-152 + BBR | 52.1 | 62.1 | 39.6 | 54.7 | 18.6 |
Table 5.
The mAP of different overlapping criterion of conditional back-propagation mechanism, where nov denotes the non-overlapping case, ov denotes the overlapping case, and ext denotes the extreme case.
Table 5.
The mAP of different overlapping criterion of conditional back-propagation mechanism, where nov denotes the non-overlapping case, ov denotes the overlapping case, and ext denotes the extreme case.
Model | mAP (%) | Vehicle AP (%) | Bike AP (%) | Pedestrian AP (%) | FPS |
---|
TSBBRs-nov | 72.8 | 81.0 | 75.5 | 61.8 | 66.7 |
TSBBRs-ov | 75.6 | 81.9 | 75.9 | 68.9 |
TSBBRs-ext | N/A | N/A | N/A | N/A |
Table 6.
Results of different settings of shielding computation.
Table 6.
Results of different settings of shielding computation.
Model | mAP (%) | Vehicle AP (%) | Bike AP (%) | Pedestrian AP (%) | FPS |
---|
K = 0 | 75.6 | 81.9 | 75.9 | 68.9 | 66.7 |
K = 1 | 77.1 | 84.2 | 77.0 | 70.0 | 71.9 |
K = 3 | 80.8 | 86.0 | 74.4 | 81.9 | 62.5 |
Table 7.
Configuration of pass through design for tiny objects.
Table 7.
Configuration of pass through design for tiny objects.
Type | Filters | Size/Stride | Output |
---|
Group Convolutional | 512 | 1 × 1 | 14 × 14 |
Convolutional | 1024 | 3 × 3 | 14 × 14 |
Convolutional | 512 | 1 × 1 | 14 × 14 |
Convolutional | 1024 | 3 × 3 | 14 × 14 |
Convolutional | 512 | 1 × 1 | 14 × 14 |
Convolutional | 1024 | 3 × 3 | 14 × 14 |
Convolutional | 18 | 3 × 3 | 14 × 14 |
Table 8.
Configuration of pass through design for large objects.
Table 8.
Configuration of pass through design for large objects.
Type | Filters | Size/Stride | Output |
---|
Group Convolutional | 512 | 1 × 1 | 7 × 7 |
Max-pooling | 512 | 2 × 2/2 | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 512 | 1 × 1 | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 512 | 1 × 1 | 7 × 7 |
Convolutional | 1024 | 3 × 3 | 7 × 7 |
Convolutional | 28 | 3 × 3 | 7 × 7 |
Table 9.
The mAP of different groups of pass through.
Table 9.
The mAP of different groups of pass through.
Number of Groups | mAP (%) | Vehicle AP (%) | Bike AP (%) | Pedestrian AP (%) | FPS |
---|
1 | 77.2 | 82.3 | 76.1 | 73.2 | 64.1 |
4 | 78.4 | 86.5 | 76.5 | 72.2 | 67.1 |
8 | 82.4 | 86.8 | 78.5 | 82.0 | 67.3 |
16 | 73.8 | 83.8 | 72.4 | 59.2 | 66.0 |
Table 10.
Comparison of the proposed method with the state-of-the-art methods.
Table 10.
Comparison of the proposed method with the state-of-the-art methods.
Models | Input | Dataset(s) | FPS | mAP (%) |
---|
YOLOv2 416 [18] | 416 × 416 | Imagenet, COCO | 67 | 76.8 |
YOLOv2 608 [18] | 608 × 608 | Imagenet, COCO | 40 | 78.6 |
YOLOv3 [36] | 320 × 320 | Open Images Dataset, COCO | X | 28.2 |
SSD 300 [17] | 300 × 300 | PASCAL VOC, COCO, ILSVRC | 59 | 74.3 |
SSD 512 [17] | 512 × 512 | PASCAL VOC, COCO, ILSVRC | 59 | 76.9 |
Proposed TSBBRs K = 1, pass through (group = 8) | 448 × 448 | PASCAL VOC, iVS Database | 67.03 | 77.1 |
Proposed TSBBRs K = 3, pass through (group = 8) | 448 × 448 | PASCAL VOC, iVS Database | 67.03 | 80.8 |
Table 11.
Comparison of computational cost.
Table 11.
Comparison of computational cost.
Models | Multiply and Accumulate (MAC) (G/frame) | FPS |
---|
YOLOv2 416 [18] | 16.56 | 68.5 |
YOLOv2 608 [18] | 30.49 | 37.4 |
YOLOv3 [36] | 28.12 | 35.0 |
SSD 300 [17] | 30.53 | 46.0 |
SSD 512 [17] | 87.84 | 19.0 |
Proposed TSBBRs K = 1, pass through (group = 8) | 14.83 | 78.2 |
Proposed TSBBRs K = 3, pass through (group = 8) | 16.89 | 67.3 |
Table 12.
Comparison of the proposed design to other models on the iVS database.
Table 12.
Comparison of the proposed design to other models on the iVS database.
Models | mAP (%) | Vehicle AP (%) | Bike AP (%) | Pedestrian AP (%) |
---|
YOLOv2 416 [18] | 69.6 | 68.2 | 74.1 | 66.6 |
YOLOv2 608 [18] | 77.9 | 80.7 | 73.4 | 79.7 |
YOLOv3 [36] | 84.8 | 86.8 | 80.3 | 87.1 |
SSD 300 [17] | 65.1 | 77.3 | 73.1 | 44.9 |
SSD 512 [17] | 79.2 | 93.8 | 72.5 | 71.3 |
Proposed TSBBRs K = 1, pass through (group = 8) | 79.0 | 86.8 | 78.5 | 71.8 |
Proposed TSBBRs K = 3, pass through (group = 8) | 82.4 | 86.6 | 78.5 | 82.0 |
Table 13.
Experiment results of the Pascal VOC 2007 car dataset.
Table 13.
Experiment results of the Pascal VOC 2007 car dataset.
Methods | Training Data | Average Precision (AP) | Processing Speed (FPS) |
---|
RCNN [30] | VOC07 car trainval | 38.52 % | 0.08 |
Fast RCNN [32] | VOC07 car trainval | 52.95 % | 0.5 |
Faster RCNN [32] | VOC07 car trainval | 59.82 % | 6 |
RV-CNN [6] | VOC07 car trainval | 63.91 % | - |
FVPN [48] | CompCars dataset [52] | 65.12 % | 46 |
Proposed (TSBBRs K = 3, pass through, group = 8) | VOC07 car trainval | 66.83 % | 67.3 |
Proposed (TSBBRs K = 3, pass through, group = 8) | The iVS database + VOC07 car trainval | 86.54 % | 67.3 |
Table 14.
Performance of the proposed design on two portable platforms.
Table 14.
Performance of the proposed design on two portable platforms.
Model | Platform | Frames Per Second (FPS) |
---|
Proposed model (TSBBRs K = 3, pass through, group = 8) | NVIDIA Drive-PX2 | 19.4 |
NVIDIA Jetson TX2 | 8.9 |
YOLOv3 | NVIDIA Drive-PX2 | 8.5 |
NVIDIA Jetson TX2 | 3.2 |