Figure 1.
Classification of commonly used object detection algorithms.
Figure 1.
Classification of commonly used object detection algorithms.
Figure 2.
Small object in aerial image with a large field of view. The pixels of the vehicle in this image only account for 0.0075% of the whole image pixels. Although it is very efficient to search for objects by collecting large field view aerial images, it poses a challenge to real-time requirements.
Figure 2.
Small object in aerial image with a large field of view. The pixels of the vehicle in this image only account for 0.0075% of the whole image pixels. Although it is very efficient to search for objects by collecting large field view aerial images, it poses a challenge to real-time requirements.
Figure 3.
Wide-area and real-time object search system of UAV. It is a system composed of software and various hardware.
Figure 3.
Wide-area and real-time object search system of UAV. It is a system composed of software and various hardware.
Figure 4.
NVIDIA Jetson AGX Xavier edge device, which was selected as the system’s central controller.
Figure 4.
NVIDIA Jetson AGX Xavier edge device, which was selected as the system’s central controller.
Figure 5.
(a) Sony A7R series cameras. They have a similar appearance; (b) The appearance of a Canon 5DS camera.
Figure 5.
(a) Sony A7R series cameras. They have a similar appearance; (b) The appearance of a Canon 5DS camera.
Figure 6.
Aerial images cover the ground with some overlap.
Figure 6.
Aerial images cover the ground with some overlap.
Figure 7.
A VTOL industrial hybrid UAV weighs 50 kg, for carrying the entire real-time object search system.
Figure 7.
A VTOL industrial hybrid UAV weighs 50 kg, for carrying the entire real-time object search system.
Figure 8.
The detailed step flowchart of the image acquisition subsystem.
Figure 8.
The detailed step flowchart of the image acquisition subsystem.
Figure 9.
The detailed step flowchart of the object detection subsystem.
Figure 9.
The detailed step flowchart of the object detection subsystem.
Figure 10.
The aerial image is first divided into small images of equal size with overlap for detection and then merged after all of the detection is completed.
Figure 10.
The aerial image is first divided into small images of equal size with overlap for detection and then merged after all of the detection is completed.
Figure 11.
(a) The situation where the pitch, roll, and yaw are all zero during the flight of the UAV (basically impossible). The recorded coordinate is the exact center of the captured image; (b) When the pitch, roll, and yaw are not all zero, the airframe coordinate system does not coincide with the geodetic coordinate system, and the captured image is deformed.
Figure 11.
(a) The situation where the pitch, roll, and yaw are all zero during the flight of the UAV (basically impossible). The recorded coordinate is the exact center of the captured image; (b) When the pitch, roll, and yaw are not all zero, the airframe coordinate system does not coincide with the geodetic coordinate system, and the captured image is deformed.
Figure 12.
A schematic diagram of projecting the projection point to the actual position of the suspected object on the true ground.
Figure 12.
A schematic diagram of projecting the projection point to the actual position of the suspected object on the true ground.
Figure 13.
Divide the software system into modules that execute in parallel. After each module has finished running and the data is handed over to the next module, it will be executed in a loop immediately.
Figure 13.
Divide the software system into modules that execute in parallel. After each module has finished running and the data is handed over to the next module, it will be executed in a loop immediately.
Figure 14.
Retain three feature maps in FPN and change the output channels to 128. The RPN, classification, and regression feature map channels are also changed.
Figure 14.
Retain three feature maps in FPN and change the output channels to 128. The RPN, classification, and regression feature map channels are also changed.
Figure 15.
Schematic diagram of YOLOv5s. The simplification of YOLOv5s mainly includes removing the head whose stride is 32 (P5), reducing the parameters of “width_multiple” and “depth_multiple”.
Figure 15.
Schematic diagram of YOLOv5s. The simplification of YOLOv5s mainly includes removing the head whose stride is 32 (P5), reducing the parameters of “width_multiple” and “depth_multiple”.
Figure 16.
(a) The original network structure; (b) The network structure after reconstruction and optimization by TensorRT.
Figure 16.
(a) The original network structure; (b) The network structure after reconstruction and optimization by TensorRT.
Figure 17.
Data augmentation for aerial images.
Figure 17.
Data augmentation for aerial images.
Figure 18.
Some images of the VisDrone2021-DET dataset.
Figure 18.
Some images of the VisDrone2021-DET dataset.
Figure 19.
(a) Install a single camera with the angle of view facing directly below the UAV; (b) Install dual cameras, and the cameras deviate from the UAV by 15° to the left or right, respectively.
Figure 19.
(a) Install a single camera with the angle of view facing directly below the UAV; (b) Install dual cameras, and the cameras deviate from the UAV by 15° to the left or right, respectively.
Figure 20.
Schematic diagram of the scanning route.
Figure 20.
Schematic diagram of the scanning route.
Figure 21.
Images of two cameras with a certain degree of overlap.
Figure 21.
Images of two cameras with a certain degree of overlap.
Figure 22.
Images containing vehicles that were successfully detected.
Figure 22.
Images containing vehicles that were successfully detected.
Table 1.
Performance comparison of some processors. TFLOPS stands for tera floating point operations per second, and TOPS stand for tera operations per second.
Table 1.
Performance comparison of some processors. TFLOPS stands for tera floating point operations per second, and TOPS stand for tera operations per second.
Device Type | Memory Size | Max Power | AI Performance | TensorRT |
---|
Jetson AGX Xavier | 32 GB | 30 W | 32 TOPS | Support |
Jetson Xavier NX | 8 GB | 20 W | 21 TOPS | Support |
Jetson TX2 | 8 GB | 15 W | 1.33 TFLOPS | Support |
Atlas 500 | 8 GB | 40 W | 22 TOPS | / |
Table 2.
Comparison of working performance of different cameras. The “Capture Interval” represents the average time from when the camera is triggered to when the image is transmitted to the edge device.
Table 2.
Comparison of working performance of different cameras. The “Capture Interval” represents the average time from when the camera is triggered to when the image is transmitted to the edge device.
Camera Type | Camera Weight | Megapixels | Image Volume | Capture Interval |
---|
Sony A7R2 | 625 g | 42.4 | 30.0 MB | 3.6 s |
Sony A7R3 | 657 g | 42.4 | 30.0 MB | 2.4 s |
Sony A7R4 | 665 g | 61.0 | 47.5 MB | 2.8 s |
Canon 5DS | 930 g | 50.6 | 35.0 MB | 2.5 s |
Table 3.
Value ranges of different data types.
Table 3.
Value ranges of different data types.
Data Type | Number of Bytes | Dynamic Range |
---|
FP32 | 4 | −3.4 × 1038~3.4 × 1038 |
FP16 | 2 | −65,504~65,504 |
INT8 | 1 | −128~127 |
Table 4.
Comparison results of various simplified methods of object detection algorithms. In this table, R50/R18 stands for ResNet50/ResNet18, and S1 represents the simplification of removing high-level feature maps on the model, only retaining the three low-level feature maps of the FPN. S2 represents the simplification of changing the number of channels of all intermediate layers, from 256 to 128. For the structure of YOLOv5s, Y1 means that the detection head with a stride of 32 (P5) is not used for detection, Y2 represents reducing the “width_multiple” from 0.50 of YOLOv5s to 0.30, and Y3 represents reducing the “depth_multiple” from 0.33 of YOLOv5s to 0.099. MobileNetV2 means that the backbone of YOLOv5s is replaced. The input size of YOLOv5s is 800, and the size of other models are 800 × 600. The calculation of FPS considers the time consumed by the preprocessing and postprocessing procedures.
Table 4.
Comparison results of various simplified methods of object detection algorithms. In this table, R50/R18 stands for ResNet50/ResNet18, and S1 represents the simplification of removing high-level feature maps on the model, only retaining the three low-level feature maps of the FPN. S2 represents the simplification of changing the number of channels of all intermediate layers, from 256 to 128. For the structure of YOLOv5s, Y1 means that the detection head with a stride of 32 (P5) is not used for detection, Y2 represents reducing the “width_multiple” from 0.50 of YOLOv5s to 0.30, and Y3 represents reducing the “depth_multiple” from 0.33 of YOLOv5s to 0.099. MobileNetV2 means that the backbone of YOLOv5s is replaced. The input size of YOLOv5s is 800, and the size of other models are 800 × 600. The calculation of FPS considers the time consumed by the preprocessing and postprocessing procedures.
Model | mAP | mAP Decline | Parameters | FLOPs | FPS |
---|
Faster-RCNN-R50-FPN | 43.3% | / | 41.12 M | 104.59 G | 4.06 |
Faster-RCNN-R50-FPN (S1) | 42.8% | 0.5% | 40.01 M | 103.70 G | 4.17 |
Faster-RCNN-R50-FPN (S2) | 43.0% | 0.3% | 31.99 M | 60.89 G | 5.53 |
Faster-RCNN-R18-FPN | 41.2% | 2.1% | 28.12 M | 79.66 G | 5.18 |
Cascade-RCNN-R50-FPN | 45.4% | / | 68.93 M | 132.39 G | 3.01 |
Cascade-RCNN-R50-FPN (S1) | 45.2% | 0.2% | 67.81 M | 131.50 G | 3.09 |
Cascade-RCNN-R50-FPN (S2) | 45.2% | 0.2% | 46.95 M | 75.85 G | 4.16 |
Cascade-RCNN-R18-FPN | 44.4% | 1.0% | 55.93 M | 107.46 G | 3.51 |
RetinaNet-R50-FPN | 35.0% | / | 36.10 M | 96.34 G | 4.66 |
RetinaNet-R50-FPN (S1) | 34.9% | 0.1% | 32.04 M | 95.50 G | 4.93 |
RetinaNet-R50-FPN (S2) | 34.2% | 0.8% | 27.92 M | 54.71 G | 5.89 |
RetinaNet-R18-FPN | 32.1% | 2.9% | 19.61 M | 72.42 G | 6.12 |
FCOS-R50-FPN | 41.1% | / | 31.84 M | 92.69 G | 4.88 |
FCOS-R50-FPN (S1) | 40.7% | 0.4% | 30.13 M | 91.56 G | 5.17 |
FCOS-R50-FPN (S2) | 40.2% | 0.9% | 25.62 M | 51.75 G | 6.39 |
FCOS-R18-FPN | 37.5% | 3.6% | 18.93 M | 71.47 G | 6.49 |
ATSS-R50-FPN | 46.2% | / | 31.89 M | 94.92 G | 4.73 |
ATSS-R50-FPN (S1) | 46.0% | 0.2% | 30.18 M | 93.79 G | 4.97 |
ATSS-R50-FPN (S2) | 45.7% | 0.5% | 25.67 M | 53.98 G | 5.96 |
ATSS-R18-FPN | 43.2% | 3.0% | 18.94 M | 71.47 G | 6.16 |
YOLOv5s | 44.0% | / | 7.05 M | 12.75 G | 11.39 |
YOLOv5s (Y1) | 43.9% | 0.1% | 5.27 M | 11.64 G | 11.66 |
YOLOv5s (Y2) | 40.7% | 3.3% | 2.69 M | 5.21 G | 13.76 |
YOLOv5s (Y3) | 42.8% | 1.2% | 6.64 M | 11.11 G | 13.18 |
YOLOv5s (MobileNetV2) | 35.2% | 8.8% | 4.54 M | 7.66 G | 13.43 |
Table 5.
The final result of object detection algorithm simplification. ALL means that the algorithm uses all of the simplification strategies mentioned in
Table 3, and Y1 + Y2 + Y3 means that only the simplification strategy of a specific sign is used.
Table 5.
The final result of object detection algorithm simplification. ALL means that the algorithm uses all of the simplification strategies mentioned in
Table 3, and Y1 + Y2 + Y3 means that only the simplification strategy of a specific sign is used.
Model | mAP | mAP Decline | Parameters | FLOPs | FPS | Improving Rate |
---|
Faster-RCNN-R50-FPN | 43.3% | / | 41.12 M | 104.59 G | 4.06 | 100.0% |
Faster-RCNN (ALL) | 40.3% | 3.0% | 19.15 M | 37.15 G | 8.12 |
Cascade-RCNN-R50-FPN | 45.4% | / | 68.93 M | 132.39 G | 3.01 | 91.0% |
Cascade-RCNN (ALL) | 43.5% | 1.9% | 34.11 M | 52.11 G | 5.75 |
RetinaNet-R50-FPN | 35.0% | / | 36.10 M | 96.34 G | 4.66 | 104.5% |
RetinaNet (ALL) | 31.3% | 3.7% | 12.89 M | 31.43 G | 9.53 |
FCOS-R50-FPN | 41.1% | / | 31.84 M | 92.69 G | 4.88 | 100.8% |
FCOS (ALL) | 36.4% | 4.7% | 12.69 M | 30.92 G | 9.80 |
ATSS-R50-FPN | 46.2% | / | 31.89 M | 94.92 G | 4.73 | 97.2% |
ATSS (ALL) | 42.6% | 3.6% | 12.70 M | 30.92 G | 9.33 |
YOLOv5s | 44.0% | / | 7.05 M | 12.75 G | 11.39 | 37.9% |
YOLOv5s (Y1 + Y2 + Y3) | 39.9% | 4.1% | 1.85 M | 4.15 G | 15.71 |
Table 6.
Performance comparison after converting the original model to FP16 or INT8 model using TensorRT. Cali. indicates that the model was calibrated before INT8 quantization, and engine represents a model file is suffixed with engine.
Table 6.
Performance comparison after converting the original model to FP16 or INT8 model using TensorRT. Cali. indicates that the model was calibrated before INT8 quantization, and engine represents a model file is suffixed with engine.
Model | Precision | mAP | mAP Decline | Model Volume |
---|
Faster-RCNN (ALL) | Original model | 40.3% | / | 120.4 MB |
FP16 | 38.9% | 1.4% | 37.7 MB |
INT8 | 37.4% | 2.9% | 24.3 MB |
INT8 (Cali.) | 38.7% | 1.6% | 29.5 MB |
Cascade-RCNN (ALL) | Original model | 43.5% | / | 240.0 MB |
FP16 | 40.7% | 2.8% | 82.8 MB |
INT8 | 39.4% | 4.1% | 47.4 MB |
INT8 (Cali.) | 40.5% | 3.0% | 52.7 MB |
RetinaNet (ALL) | Original model | 31.3% | / | 70.3 MB |
FP16 | 31.1% | 0.2% | 27.9 MB |
INT8 | 8.1% | 23.2% | 21.4 MB |
INT8 (Cali.) | 31.0% | 0.3% | 21.7 MB |
FCOS (ALL) | Original model | 36.4% | / | 68.8 MB |
FP16 | 36.3% | 0.1% | 28.2 MB |
INT8 | 4.5% | 31.9% | 22.2 MB |
INT8 (Cali.) | 35.8% | 0.6% | 21.7 MB |
ATSS (ALL) | Original model | 42.6% | / | 68.8 MB |
FP16 | 40.9% | 1.7% | 27.7 MB |
INT8 | 37.6% | 5.0% | 21.7 MB |
INT8 (Cali.) | 40.7% | 1.9% | 21.3 MB |
YOLOv5s (Y2 + Y3) | Original model | 39.8% | / | 5.6 MB |
FP16 | 38.5% | 1.3% | 6.4 MB (engine) |
INT8 (Cali.) | 38.3% | 1.5% | 1.4 MB (engine) |
Table 7.
Comparison of inference speed with and without TensorRT deployment (INT8).
Table 7.
Comparison of inference speed with and without TensorRT deployment (INT8).
Model | FPS (Original) | FPS (TensorRT) | Improvement Rate |
---|
Faster-RCNN (ALL) | 8.12 | 18.72 | 130.5% |
Cascade-RCNN (ALL) | 5.75 | 13.85 | 140.8% |
RetinaNet (ALL) | 9.53 | 20.20 | 111.9% |
FCOS (ALL) | 9.80 | 20.46 | 108.7% |
ATSS (ALL) | 9.33 | 19.91 | 113.3% |
YOLOv5s (Y2 + Y3) | 15.48 | 27.20 | 75.7% |
Table 8.
Comparisons of the performance of the final model used on the edge device and the benchmark model.
Table 8.
Comparisons of the performance of the final model used on the edge device and the benchmark model.
Benchmark Model | mAP | mAP Final | mAP Decline | FPS | FPS Final | Improvement Rate |
---|
Faster-RCNN-R50-FPN | 43.3% | 38.7% | 4.6% | 4.06 | 18.72 | 361.0% |
Cascade-RCNN-R50-FPN | 45.4% | 40.5% | 4.9% | 3.01 | 13.85 | 360.1% |
RetinaNet-R50-FPN | 35.0% | 31.0% | 4.0% | 4.66 | 20.20 | 333.4% |
FCOS-R50-FPN | 41.1% | 35.8% | 5.3% | 4.88 | 20.46 | 319.2% |
ATSS-R50-FPN | 46.2% | 40.7% | 5.5% | 4.73 | 19.91 | 320.9% |
YOLOv5s | 44.0% | 38.3% | 5.3% | 11.39 | 27.20 | 138.8% |
Table 9.
The coverage of a single acquisition, using a Sony A7R2 camera with a 55 mm prime lens.
Table 9.
The coverage of a single acquisition, using a Sony A7R2 camera with a 55 mm prime lens.
Fly Relative Altitude | Width Coverage | Height Coverage | Resolution Per Pixel |
---|
300 m | 195.818 m | 130.909 m | 0.025 m |
400 m | 261.091 m | 174.545 m | 0.033 m |
500 m | 326.364 m | 218.182 m | 0.041 m |
600 m | 391.636 m | 261.818 m | 0.049 m |
700 m | 456.909 m | 305.455 m | 0.057 m |
800 m | 522.182 m | 349.091 m | 0.066 m |
900 m | 587.455 m | 392.727 m | 0.074 m |
1000 m | 652.727 m | 436.364 m | 0.082 m |
Table 10.
Execution time of each module of the system.
Table 10.
Execution time of each module of the system.
Module | Average Execution Period |
---|
Image Acquisition (Camera = 1) | 3.8 s |
Communicate with UAV | Ignored |
Store Original Image | 0.9 s |
Object Detection (N = 5) | 2.4 s |
Store Detect Results | 1.1 s |
Results Real-time Transmit | Ignored |
Average sequential execution period | 8.2 s |
Average parallel execution period | 3.8 s |
Table 11.
The coverage of a single acquisition by using two cameras.
Table 11.
The coverage of a single acquisition by using two cameras.
Fly Relative Altitude | Width Coverage | Height Coverage | Resolution Per Pixel |
---|
300 m | 410.451 m | 130.909 m | 0.029 m |
400 m | 547.268 m | 174.545 m | 0.039 m |
500 m | 684.086 m | 218.182 m | 0.049 m |
600 m | 820.902 m | 261.818 m | 0.059 m |
700 m | 957.719 m | 305.455 m | 0.069 m |
800 m | 1094.537 m | 349.091 m | 0.078 m |
900 m | 1231.355 m | 392.727 m | 0.088 m |
1000 m | 1368.170 m | 436.364 m | 0.098 m |
Table 12.
Execution cycle of each module of the dual camera system.
Table 12.
Execution cycle of each module of the dual camera system.
Module | Average Execution Period |
---|
Image Acquisition (Camera = 2) | 4.0 s |
Communicate with UAV | Ignored |
Store Original Image | 1.4 s |
Object Detection (N = 5) | 4.6 s |
Store Detect Results | 1.7 s |
Results Real-time Transmit | Ignored |
Average sequential execution period | 11.7 s |
Average parallel execution period | 4.6 s |