1. Introduction
Remote sensing image target detection has been extensively applied in various fields, including land resources planning [
1], glacier change monitoring [
2], disaster prevention and relief [
3], military national defense [
4], urban safety supervision [
5], forestry vegetation monitoring [
6], and many others. It holds significant scientific research value and offers broad application prospects. Remote sensing image object detection can be broadly categorized into traditional human-based feature extraction methods and deep learning-based methods [
7]. Traditional methods involve manually designing effective feature extractors to perform target detection. On the other hand, deep learning methods employ convolutional neural networks (CNNs) to extract features in a data-driven manner, allowing for continuous learning and achieving object detection tasks. With the continuous advancement and maturation of deep learning theory and technology, object detection performance based on deep learning has surpassed traditional methods by a significant margin. It has also gained broader applications in various domains, including image classification and object detection tasks. Deep learning-based general object detection algorithms include anchor-based methods. Prominent examples of anchor-based algorithms are one-stage detection algorithms like SSD [
8]; the YOLO series [
9]; and two-stage detection methods like Faster R-CNN [
10] and Mask R-CNN [
11]. Each of these algorithms has its own strengths and offers distinct advantages in practical engineering applications. The two-stage detection algorithm divides the object detection process into two stages. First, it generates candidate regions and then refines their positions before classifying them. The advantages of the two-stage detection algorithm are its low detection and recognition error rates, as well as a low rate of false positive detections. However, it has the drawback of slower detection speed, making it less suitable for real-time object detection tasks in videos or images. Therefore, two-stage detection algorithms are commonly used in applications that require high-precision object detection, such as face recognition, medical image analysis, and so on. In contrast, the one-stage detection algorithm does not involve a separate stage for generating candidate regions. Instead, it directly generates the category probability and position coordinate values of the objects. The final detection result can be obtained after a single pass, resulting in faster detection compared to the two-stage algorithm. Therefore, single-stage detection algorithms are commonly used in scenarios that require fast processing, such as real-time video monitoring, autonomous driving, and so on. In recent years, there have been significant advancements in anchor-free object detection algorithms, such as CenterNet [
12] and FCO (fully convolutional one-stage object detection) [
13]. These algorithms revolutionize the traditional approach by eliminating the dependence on predefined anchors.
Target detection in remote sensing images has garnered significant attention in the field. The advancement of remote sensing satellite technology in recent years has spurred research in spaceborne remote sensing image target detection systems. When applying target detection in spaceborne remote sensing images to practical engineering, it becomes crucial to consider not only detection accuracy but also real-time capabilities of the network model and power consumption of the deployment platform. These factors play a vital role in ensuring efficient and effective target detection in the context of remote sensing imagery. To illustrate, let us consider the power constraints of small remote sensing satellites like CubeSats. These satellites typically operate within a power budget of only 2–8 W [
14]. In contrast, GPU platforms such as A100, RTX 2080Ti, RTX 3090, Tesla V100, and others have power consumption levels that far exceed the carrying capacity of the satellite. This stark difference in power requirements poses a significant challenge when it comes to deploying these GPU platforms directly on board. Therefore, it becomes essential to explore alternative approaches or optimize existing algorithms to ensure efficient and effective target detection within the power limitations of small remote sensing satellites. As a result, the conventional approach in satellite remote sensing image processing involves transmitting the captured images from satellites or aerial drones to GPU platforms on the ground for further processing. Due to the power limitations of small remote sensing satellites, it is more practical to offload the computational tasks to powerful GPU platforms that are readily available on the ground. This approach allows for the utilization of high-performance computing resources while overcoming the power constraints of the satellite. By transmitting the images to a GPU platform on the ground, researchers can leverage the capabilities of these platforms to process the remote sensing data efficiently and accurately. However, the process of downloading image data to the ground for processing introduces significant delays in data processing, resulting in reduced system efficiency. This limitation poses challenges in meeting real-time detection requirements [
15]. To address this issue, researchers have begun exploring the deployment of convolutional neural network (CNN) models on application-specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs) [
16]. These specialized hardware platforms offer the potential for accelerating and optimizing the processing of image data, thereby improving overall system efficiency. By leveraging the parallel computing capabilities of ASICs and FPGAs, researchers aim to overcome the latency associated with transmitting data to the ground, enabling more efficient real-time detection. This research direction can overcome the challenges posed by traditional GPU processing and meet the stringent requirements of real-time detection in remote sensing applications. Among them, FPGAs have emerged as an alternative to GPUs in spaceborne scenarios due to their advantages such as low power consumption, high performance, parallel computing capabilities, programmability, and customization. They possess the characteristics of high flexibility and lower cost compared to ASICs [
17].
When deploying CNN models on embedded devices, it is common to trade off some accuracy in order to improve detection speed and reduce the number of model parameters. Currently, commonly used methods include model compression [
18] and the utilization of more lightweight network models [
19] to decrease the model size. These techniques enable the deployment of CNN models on embedded devices with reduced computational complexity and memory footprint, while still maintaining acceptable performance levels.
To cater to the real-time target detection needs of remote sensing images in spaceborne applications, while addressing challenges related to deep learning model parameters, computational complexity, and hardware platform deployment, this paper presents an efficient lightweight algorithm called YOLOv4-MobileNetv3. Additionally, channel pruning is performed on this algorithm. Furthermore, the Xilinx Vitis AI tool chain is employed to quantize, compile, and deploy the pruned network model at the edge. The primary contributions of this paper can be summarized as follows:
Significant Reduction in Model Parameters: The acceleration strategy we proposed achieves a remarkable reduction in the number of model parameters, with the parameter size being only 0.09 times that of the original YOLOv4. This reduction not only enhances the efficiency of model storage but also contributes to faster inference times.
Proposed Deployment Scheme: We have developed a novel deployment scheme for YOLOv4-MobileNetv3 on an FPGA using channel layer pruning. The proposed deployment scheme addresses the challenges of high power consumption and real-time performance in hardware deployment platforms for object detection, enabling efficient implementation on resource-constrained hardware.
Superiority in Energy Efficiency and Real-Time Processing: Our lightweight algorithm model outperforms CPUs and GPUs in terms of both energy efficiency and real-time performance. This makes our proposed solution highly suitable for spaceborne remote sensing image processing systems, where the efficient utilization of resources is crucial.
The remaining sections of this paper are structured as follows:
Section 2 provides an overview of related work, including common lightweight network model structures, compression techniques for network models, and FPGA deployment of CNN models.
Section 3 presents the experimental design for this study. It begins by introducing image enhancement techniques, followed by the detailed description of the improved YOLOv4-MobileNetv3 network model structure, then introduces the related experimental work of sparse training and channel pruning for YOLOv4-MobileNetv3. Finally, it covers the process of utilizing Vitis AI for quantizing the pruned network model and compiling it for deployment on FPGA platforms.
Section 4 outlines the experimental setup and analyzes the results. It evaluates and compares the performance of the improved network model against the mainstream network model. Additionally, it analyzes and compares the performance of the network model deployed on different hardware platforms. Lastly,
Section 5 provides a summary of the entire paper. It discusses the key findings and future research directions in this field.
3. Methodology
3.1. GridMask-Mosaic Data Augmentation
Data enhancement is a common method in the field of image processing to improve the robustness of network models. The common methods are to crop, rotate, adjust brightness, flip, and mask the image. Compared with YOLOv3, YOLOv4 uses the Mosaic algorithm in data enhancement, which can better help the model to better learn targets from different angles and locations, thereby improving the generalization ability and robustness of the network model [
47]. Mosaic data enhancement is an improved version of CutMix, and the two have certain similarities in theory. CutMix data augmentation involves splicing two images together, where a random region is selected from one image and cut out, then pasted onto another image to create a new training sample. This helps the model learn to handle boundaries and contextual information between different objects. Mosaic data augmentation, on the other hand, randomly arranges and reorganizes four input images by applying random zooming, random cropping, random flipping, and color gamut changes to generate a new composite image. This provides diverse and varied training samples, aiding the model in learning to adapt to various complex scenes and object combinations. Mosaic data enhances the image processing effect, as shown in
Figure 3.
In practical applications, due to the complex shooting background environment, it is inevitable that some remote sensing images will be occluded. The existing datasets do not contain a sufficient number of occlusion scenes, and the data augmentation techniques incorporated in the YOLOv4 network model do not support augmentation for occlusion. In order to solve this problem, we use the GridMask algorithm to perform a target random erasure operation on the training set part of the DIOR dataset, so as to achieve the purpose of improving the robustness of the network model.
GridMask belongs to the data enhancement algorithm of the information deletion category, which pays special attention to the degree of grasp of image information deletion. Too much deletion of image information will cause the remaining image information to fail to correctly express the target information, thus becoming noisy data, while too little deletion of image information will cause the target to be unaffected and cannot improve the robustness of the model [
48]. The GridMask algorithm is based on the original information deletion algorithm, and uses evenly distributed square areas to delete image information, so that large areas of continuous images will not be deleted. The data enhancement effect of GridMask combined with Mosaic is shown in
Figure 4.
For the GridMask data enhancement algorithm, set the output image to , in which represents the input image and represents the binary mask. If it is 1, the original image information is retained, otherwise the original image information is deleted, which represents the output image after the image has been processed by the GM algorithm. Use parameters to represent M and M is unique, where r means the ratio of retaining the input image information, d determines the size of the deleted area, and and are the distance from the first complete unit to the image boundary.
Where
r represents the proportion of information in the input image, and for the parameter
r, define the hold ratio
t given to
M as shown in Equation (
1):
where
W and
H refer to the width and height of the input image, respectively; if
t is too small, the image will lose too much information, and the remaining area will become noisy data, which cannot achieve the effect of improving the robustness of the model. The relationship between
r and
t is shown in Equation (
2).
where
d determines the size of a single deleted region, and when
r remains unchanged, the relationship between the side lengths
l and
d of a single deleted region is shown in Equation (
3).
For
d, the larger the
d, the larger the
l, and during training, the proportions are kept unchanged. In addition, randomness is added to expand the diversity of the image, and the value of
d is shown in Equation (
4).
When
d is small, the probability of experimental failure can be reduced, but when it is too small, it will lead to poor experimental results and cannot effectively improve the robustness of the model. The masks of
r and d can be determined by moving
and
, and the values of
and
are shown in Equations (
5) and (
6), respectively.
3.2. YOLOv4-MobileNetv3 Framework
Through the analysis of the network structure of YOLOv4, it can be seen that using CSPDarkNet53 as the backbone extraction network can achieve better feature extraction capabilities, but the network model still has the problem of a large number of parameters and high computational complexity, which will consume a lot of computing resources and is not suitable for deployment to embedded devices and FPGAs. We propose the YOLOv4-MobileNetv3 network framework to solve the problem of large model parameters and high computational complexity. The YOLOv4-MobileNetv3 network framework is shown in
Figure 5. By adopting the lightweight network MobileNetv3 as the backbone extraction network, the depthwise separable convolution is used to improve the network convolution layer to reduce the amount of parameters. MobileNetv3 is a lightweight network that, compared to other lightweight networks like Ghost, achieves higher performance while maintaining its lightweight nature through the introduction of effective design strategies and improved structure. It excels in accuracy and is particularly suitable for tasks that require higher precision. By using lightweight components and operations, MobileNetv3 effectively reduces the number of parameters and computational complexity, resulting in better efficiency in resource-constrained environments such as mobile devices. This means that it can run on lower computational resources and is well suited for embedded devices and mobile applications. In order to make the DPU support the neural network operator of the network model, we also use the LeakyReLU function as the activation function, and the LeakyReLU function is shown in the Equation (
7). LeakyReLU is a commonly used activation function in neural networks that enhances the robustness of the network. Unlike the traditional ReLU activation function, LeakyReLU returns a small slope (typically 0.01) on its negative input range instead of simply returning 0. Its main advantage is that it increases the robustness and non-linearity of neural networks while addressing some training issues. The parameter
is a constant smaller than 1, typically set to 0.01. The purpose of this function is to increase the non-linear characteristics of neural networks and prevent the issue of dead neurons in the output layer. Additionally, LeakyReLU can mitigate the problem of vanishing gradients during training, thereby improving the training performance of the neural network. The SPP model pooling sizes of the original YOLOv4 are 9 and 10, but due to compatibility problems with the DPU operator of KV260, maxpool can only be downloaded from 22–88. So, we modified the pooling size of the SPP model of YOLOv4 to 3, 5 and 7. Changing the pool size of the initial SPP model can result in information loss and a decrease in spatial resolution, which may impact detection performance for certain specific tasks. However, reducing the pool size allows us to decrease computational requirements to some extent, thereby improving the inference speed and efficiency of the model.
MobileNetv3 continues the linear bottlenecks and inverted residuals of MobileNetv2, introduces depth separable convolution, uses the hswish activation function instead of the swish function to reduce the amount of calculation, and adds the SE attention mechanism module to improve the detection accuracy of the network model.
After modifying the feature extraction network and activation function, you only need to train according to the normal method of the YOLOv4 model. Experiments show that the parameters of the improved YOLOv4-MobileNetv3 model after lightweight processing are 11.47 M, which is 82.09% less than the original YOLOv4 network model.
3.3. YOLOv4-MobileNetv3 Model Channel Pruning
The inference stage of the network model requires a large amount of memory bandwidth, and the YOLOv4-MobileNetv3 model still has redundant weights and activation values after lightweight processing, which leads to inefficient calculation in the model inference stage. In addition, due to the limitation of instruction sets and basic operations on hardware devices, many innovative modules or network layers will fail compilation, which greatly limits the deployment of lightweight networks on the embedded side and FPGA [
49]. In view of the above situation, channel pruning is used to further compress the YOLOv4-MobileNetv3 network model.
Compared to pruning methods that remove individual neurons, channel pruning does not introduce sparsity into the initial CNN model architecture, thus requiring no special hardware or software to implement [
50]. This makes channel pruning extremely versatile in network model pruning, applicable to almost all model inference platforms, and suitable for hardware acceleration [
51]. In channel pruning, the training acceleration of the YOLOv4-MobileNetv3 network model is achieved by using batch normalization (BN) between adjacent convolutional layers. The normalization operation is shown in Equation (
8).
where
represents the mean of the input feature
x,
represents the variance,
represents the bias, and
represents the trainable scale factor. During training,
is used to distinguish between important and non-important channels of the YOLOv4-MobileNetv3 network model. After selecting the pruning rate, non-critical channels are pruned, as shown in
Figure 6. Channel sparsity training is performed by using L1 regularization on
; sparsity training loss
is shown in Equations (
9) and (
10).
The
in the first part of the equation represents the loss of YOLOv4-MobileNetv3;
represents the penalty factor;
represents the
norm of
. Equation (
11) is obtained by expanding Equation (
9) at
, where
H is represented as a Hessian matrix, and since the trainable scale factors
in the parameter are independent of each other,
H can become a diagonal matrix, as shown in Equation (
12).
Then, the original formula
can be expanded into Equation (
13).
Combined with the
assumption of mutual independence, we can obtain Equation (
14).
Equation (
15) can be obtained by deriving the above Equation. For the Sgn function, if the input is
than 0, the Sgn function will return 1; if it is equal to 0, the Sgn function will return 0; if it is less than 0, the Sgn function will return
.
In summary, it can be obtained
, as shown in Equation (
16).
The scale factor of each BN layer of the YOLOv4-MobileNetv3 network before and after pruning is shown in
Figure 7 and
Figure 8. It can be seen that after the sparse training is completed, the
coefficient is concentrated around 0, which means that the sparse training is saturated. The output of a channel close to 0 approximates a constant and can be clipped.
After sparse training at 300 epochs, channel pruning was performed on the sparse YOLOv4-MobileNetv3 network model using a pruning rate of 0.85. The experimental results show that the parameter size of the model after pruning is 9.42 M, and the parameter size is reduced by 2.05 M.
3.4. Vitis AI Deploys CNN
This paper adopts Xilinx Zynq (AMD Xilinx, San Jose, CA, USA) a software and hardware collaborative acceleration platform combining FPGA and ARM UltraScale + MPSoC realizes the edge deployment of YOLOv4-MobileNetv3. UltraScale + MPSoC is composed of two core devices: programmable logic (PL) and a processing system (PS). It overcomes the lack of real-time computing capability of the ARM processor, and improves the limited algorithm realization capability of the FPGA. UltraScale + MPSoC adopts TSMC’s 16 nm FinFET process node, which not only has high performance, but also greatly reduces power consumption.
3.4.1. YOLOv4-MobileNetv3 Model Quantization
It is important to note that quantification methods also need to be used in conjunction with specific hardware platforms. For example, NVIDIA uses TensorRT to quantify network models and deploy them on Nano, TX1, and TX2 devices. Since our FPGA evaluation boards are part of the Zynq UltraScale+ MPSoC series, we can choose Vitis AI’s quantizer to quantize the model. The Vitis AI quantizer can reduce the computational complexity of the network model with little loss of inference accuracy. The deep learning frameworks currently supported by the Vitis AI quantizer are PyTorch and TensorFlow, and the quantizer names are vai_q_pytorch and vai_q_tensorflow, respectively. We use Int8 fixed-point quantization to optimize the algorithm model, and convert the 32-bit floating-point number into an 8-bit fixed-point number format. The quantization algorithm is as Algorithm 1. In the Vitis AI quantizer, the Pytorch quantizer supports two different methods: post-training quantization (PTQ) and quantize aware training (QAT), to quantize deep learning models. This paper selects the PTQ scheme to quantize the YOLOv4-MobileNetv3 model. PTQ is a technology that converts the pretrained floating-point model into a quantized model, and its characteristic model accuracy loss is extremely low. Model quantification generally needs to determine the number of quantized pictures according to the scene. The more detection categories and the more complex the scene, the more pictures required for quantization. Combined with the characteristics of many detection scenes in the DIOR dataset, this experiment uses 4000 images as a representative calibration dataset for calibration to run several batches of inference on the floating-point model to obtain the distribution of activations.
Algorithm 1: Model Quantization |
|
3.4.2. YOLOv4-MobileNetv3 Model Compilation and Deployment
The quantized network model needs to be compiled and mapped into a highly optimized DPU instruction stream before the DPU core can be scheduled on the FPGA to achieve hardware acceleration. The compiler in the Vitis AI suite can perform the compilation operation of the quantized network, which includes an optimizer, parser, and code generator. After the Vitis AI Library compiler (VAL_C) analyzes the topology of the optimized and quantized input model, VAL_C will build an internal calculation graph as an intermediate representation (IR), and build a control flow and data flow based on this. Then, it compiles and optimizes operations based on the calculation graph. Finally, the code generator will map the optimized calculation graph to the DPU instruction stream.
3.4.3. DPU Core Configuration
A deep learning processor (DPU) is a programmable accelerator optimized for neural network models. It consists of a set of parameterizable IP cores that can be implemented through hardware programming without layout and wiring. DPU supports most operations of deep learning, such as convolution, pooling, full connection, activation, etc. Vitis AI provides a series of DPUs for Xilinx’s Kria KV260, Versal card, AIveo card, Zynq UltraScale + MPSoC and other embedded devices. The parameters of the DPU can be configured according to the application, so as to improve throughput, latency, and scalability, and unique differences and flexibility are achieved in terms of power consumption. The DPU architecture parameters are shown in
Table 1. Combining the logic resources of KV260 and the parameters of the network model, we choose B4096 as the core of the DPU architecture. The DPU 4096 architecture provides high computational power and processing speed, making it suitable for handling complex tasks and large-scale data. The KV260 faces high demands for computational resources, requiring the processing of massive amounts of data and complex calculations. The high-performance computing capability of the DPU 4096 makes it an ideal choice to meet these requirements. Additionally, the DPU 4096 architecture has been widely applied and validated, ensuring its stability and reliability. Choosing a proven architecture like this helps mitigate risks during development and deployment and ensures system stability and consistency.
4. Experiments and Evaluation
This section will describe the experimental details. By comparing and analyzing the performance of different network models, we can evaluate the superiority of our design scheme. At the same time, through the comparison of power consumption of different hardware platforms, it is shown that our design scheme has significant advantages compared with GPU and CPU in the application scenario of spaceborne remote sensing images.
4.1. Dataset Description
We use the optical remote sensing image dataset DIOR created by Li et al. of Northwestern Polytechnical University to evaluate the performance of the algorithm model [
52]. The DIOR dataset is a large-scale optical remote sensing image dataset, including 23,463 images with a size of
and a resolution ranging from 0.5 m to 30 m. There are 192,472 instances in total, covering 20 object categories.The images in the DIOR dataset cover a span of several years, capturing a wide range of imaging conditions, weather variations, and seasons. This diversity ensures that the dataset provides a representative collection of remote sensing data. In our experiments, 80% of the images in the DIOR dataset are selected as the training set, along with 10% for the validation set and 10% for the test set.
Figure 9 shows some examples of images from the DIOR dataset. Sample images of golffield, vehicle, trainstation, chimney, groundtrackfield, airplane, stadium, tenniscourt, storage tank, ship, windmill and airport are shown in (a) to (l), respectively.
4.2. Experiment Environment
The deep learning framework of this paper adopts Pytorch1.8, based on the operating system of Ubuntu18.04; the Python version is 3.8, the integrated development environment is PyCharm professional version, and the Vitis AI version is Vitis AI 2.5. Two NVIDIA A100s are used to train the network model and pruning. The hardware platform CPU is AMD R7-4800H, the GPU is NVIDIA RTX 2060 (6G), and the FPGA is Xilinx KV260. During the training and pruning process, the batch size is set to 32, the input image size is 416 × 416, the initial learning rate is 0.001, the optimizer is SGD, and the cosine annealing strategy is used to dynamically adjust the learning rate.
4.3. Evaluation Indicators
In this paper, mean of average precision (mAP) is used to evaluate the target detection effect; the parameter quantity and FPS are used to evaluate the detection speed; the class deployment platform is extremely important, so the power consumption index is used to evaluate the processing performance of the hardware deployment platform. The relevant formulas are as follows:
In Equation (
17),
x represents the number of detected pictures,
T is the time consumed to detect
x pictures, FPS represents the number of pictures detected per unit time: the higher the FPS, the faster the detection speed, the higher the real time. In Equations (
18) and (
19),
is the detected false sample, which is the sample detected in the target detection task where the target class is inconsistent with the real class;
is a true sample, indicating that the detected target class is the same as the real class, and
is a false negative sample, indicating a sample that actually exists but has not been detected. Equation (
21)
refers to the average of the average accuracy (AP), and all classes detected by the model can be drawn from the accuracy recall to form a curve, and the area enclosed by this curve and the coordinate axis is
, which is represented by Equation (
20). N represents the detection categories.
Xilinx provides the Maxim PowerTool GUI tool and Uart software (AMD Xilinx, San Jose, CA, USA, version 2.32.03) for developers to perform power testing of FPGas. According to the actual situation, this paper uses the Maxim PowerTool USB cable with chip and the corresponding GUI software to test the real-time power of FPGA. Although the Maxim PowerTool can display real-time current and voltage levels, it cannot save historical data. We programmed a program in Python language to read out memory values in real time and detect average power by reading power values in continuous time. The calculation formula for the power metric is a fundamental equation used to quantify the rate of energy transfer or consumption in a system. It is expressed as Equation (
22), where T is the total time of consumption, and
p(t) represents the function of power as a function of time.
4.4. Ablation Experiment
In order to verify the effect of algorithm model optimization, the feature extraction network, convolution module, channel pruning, and quantization models in the experiment were ablated experiments, and the experimental results are shown in
Figure 10. In
Figure 10, A represents the original YOLOv4 network model with 64.04 M parameters and an mAP of 84.47%. B represents the model using the MobileNetv3 feature extraction network without deep separable convolution modules, with a parameter size of 39.99 M and an mAP of 83.69%. C represents the model using the MobileNetv3 feature extraction network with deep separable convolution modules, with a parameter size of 11.47 M and an mAP of 81.54%. D represents the model after data enhancement using GridMask and Mosaic algorithms, with 11.47 M parameters and an mAP of 82.24%. E represents the model after applying channel-level pruning, with 9.42 M parameters and an mAP of 83.03%. F represents the model after Vitis AI quantization, with 5.69 M parameters and an mAP of 82.61%.
From the data in the figure, it is evident that after replacing the feature extraction network and incorporating deep separable convolution operations, the model accuracy decreased by 2.93%, while the number of model parameters was significantly reduced by 52.57 M. This is because lightweight feature extraction networks and deep separable convolution networks, compared to the original network structure, sacrifice some feature representation capabilities, resulting in lower accuracy. However, this reduction in accuracy comes with significant reductions in the number of model parameters, which leads to improved efficiency when dealing with limited computing resources. Following the data augmentation operation, there was a 0.7% improvement in the accuracy of the algorithm, indicating that data augmentation has a positive effect on enhancing model accuracy. Channel pruning reduces the number of parameters in the model by identifying and removing redundant channels in the network. After pruning, the unused channels no longer contribute to the forward and backpropagation processes, resulting in reduced computational and storage requirements. By applying channel pruning and fine tuning, an additional reduction of 2.05 M parameters is achieved, making the model more lightweight. Moreover, through the fine-tuning and retraining process, the model is further optimized and adapted, leading to improved accuracy and an additional 0.79% performance boost. Channel trimming and fine-tuning effectively compress the model size, reduce parameters, and enhance accuracy through retraining and optimization. This strategy is highly effective for optimizing models, striking a balance between reducing complexity and resource usage while maintaining high performance. Finally, after model quantization, the parameter size of the model decreased by 39.6%, while the accuracy decreased by 0.42%. In conclusion, the model improvement scheme presented in this paper reduces the size of the model parameters to 5.69 M at the expense of 1.86% mAP, which is a reduction of 91.11% compared to the original YOLOv4 model. The impact of each improvement scheme is evident, as they synergistically complement each other and ultimately deliver notable results in terms of model parameter count and detection accuracy metrics.
4.5. Comparative Experiments
Table 2 compares the data of the lightweight model proposed by us after pruning with the current mainstream target detection models Faster-RCNN, SSD, Centernet, YOLOv3, and YOLOv4. From the data in the
Table 2, we can see that the mAP of the YOLOv4-MobileNetv3 model is 83.03%, 14.28% higher than Faster-RCNN, 4.85% higher than SSD, 4.05% higher than Centernet, 2.93% higher than YOLOv3, and 1.44% lower than YOLOv4. The parameter size is 9.42 M, which is 93.13% less than Faster-RCNN, 63.98% less than SSD, 71.17% less than Centernet, 84.72% less than YOLOv3, and 85.29% less than YOLOv4. Compared to Faster-RCNN, SSD, Centernet, and YOLOv3, our improved model achieves the best mAP with minimal parameters. The accuracy of the pruned and fine-tuned YOLOv4-MobileNetv3 model remains lower compared to the original YOLOv4 network. This disparity can be attributed to the inadequate feature expression capability resulting from the utilization of a lightweight feature extraction network. Consequently, the model struggles to capture abundant information and target features in the input image, leading to a decrease in the detection accuracy. In contrast, we made a compromise by sacrificing 1.44% of the mAP in order to reduce the model parameters to only 0.15 times the size of the original YOLOv4 network. While this reduction in accuracy is within acceptable limits, our main focus lies in achieving optimal performance on the FPGA platform. This strategic decision holds significant importance for our specific application requirements in terms of computational efficiency and resource utilization.
The comparison of the partial image detection effect of the pruned YOLOv4-MobileNetv3 algorithm framework proposed in this paper and the mainstream target detection algorithm on the dataset DIOR is shown in
Figure 11.
4.6. Hardware Acceleration Platform Comparison
In order to verify the superiority of the proposed deployment scheme, the mainstream deep learning general-purpose processor CPU and GPU were selected for experimental comparison of inference performance. The CPU selected is the AMD R7-4800H and the GPU is NVIDIA RTX 2060 (6G), and the table shows the performance test data comparison of FPGA, GPU, and CPU. As can be seen from the data in
Table 3, the FPGA deployment scheme proposed by us reduces power consumption by 81.91% and increases FPS by 317.88% compared with CPU AMD R7-4800H. Compared to the GPU NVIDIA RTX 2060, power consumption is reduced by 91.41% and FPS is increased by 8.50%. Our design solution has a detection speed much higher than the normal human eye persistence time of 24 frames per second, which meets the real-time requirements of object detection, and the power consumption is only 7.2 W, which can meet the low power consumption requirements of airborne and spaceborne aircraft.
The improved model deployed to Xilinx KV 260 after quantization has an mAP of 82.61%, which is 0.42% lower than that of the model before quantization, which is due to the slight accuracy degradation of floating-point numbers due to quantization. During the quantization process of Vitis AI, precision loss occurs due to the conversion of original floating-point parameters into fixed-point representation. In floating-point representation, parameters can have higher precision and can represent numbers with multiple decimal places, while fixed-point representation can only represent numbers with a fixed number of decimal places or integers. In the quantization process, floating-point parameters are rounded or truncated into fixed-point parameters to accommodate limited computational resources of hardware platforms. This can cause information loss and rounding errors, as the fine details and decimal places of the original model are constrained to fewer bits. Precision loss in the quantization process can potentially impact the performance and accuracy of the model. As the parameters become coarser, the model may fail to capture subtle features and complex relationships present in the original model. As shown in
Figure 12, it is evident that despite the accuracy loss resulting from the quantization process, overall, there is minimal impact on the target detection effect. This implies that the quantization process does not hinder the practical application of the model in real-world projects.
4.7. Comparison of Different Studies
Table 4 provides a comparison between our proposed approach and previous methods. It is noteworthy that our solution has achieved the best results in terms of FPS and power consumption, which can be attributed to the integration of a series of lightweight techniques. By employing a lightweight feature extraction network that combines depth-wise separable convolutions and utilizing two crucial model compression methods, pruning and quantization, we have ultimately enabled our model to exhibit exceptional detection speed performance. Additionally, when comparing Kv260 with ZCU104, Kv260 demonstrates a power advantage.
5. Conclusions
This paper proposes an FPGA-based image processing solution for remote sensing images on the aerospace platform, which is based on a lightweight network and network pruning technique applied to YOLOv4-Mobilenetv3. The main improvements of this solution include the use of the lightweight feature extraction network MobileNetv3, aiming to reduce the model parameters and improve the detection speed. To further reduce the model parameters and improve computational efficiency, deep separable convolutions are utilized. Additionally, channel pruning techniques are combined to prune redundant parameters, further reducing the model parameters and redundant calculations, thereby improving computational efficiency. In response to the decrease in accuracy after pruning, we fine-tune the model to enhance its accuracy, even surpassing the accuracy before pruning. To adapt to the complex backgrounds of remote sensing images, we apply data augmentation techniques such as Mosaic and GridMask on the DIOR dataset to enhance the model’s robustness and adaptability to various complex backgrounds. For the UltraScale + MPSoC platform used, we utilize the matching quantization tool Vitis AI Quantizer to perform post-training quantization (PTQ) on the pruned YOLOv4-Mobilenetv3 network model, and compile it using Vitis AI compiler, finally deploying it on the Xilinx KV260.
The comparative experimental results on the DIOR dataset demonstrate that our improved network model outperforms mainstream object detection models in terms of mAP and parameter quantity, with mAP slightly lower than the original YOLOv4 model. To validate the superiority of our FPGA acceleration solution, we select the mainstream deep learning hardware deployment platforms, CPU and GPU, for performance comparison. Experimental results show that our FPGA deployment solution has significant advantages in terms of power consumption and FPS. Compared with CPU AMD R7 4800H, FPS improves by 317.88% and power consumption decreases by 81.91%; compared with GPU NVIDIA RTX 2060 (6G), power consumption decreases by 91.85%, and FPS increases by 8.50%. In the scenarios of unmanned aerial and aerospace remote sensing image processing, limited by power consumption, weight, and volume, our design solution meets the requirements of low power consumption and high real-time performance.
The high-performance, low-power image processing hardware acceleration solution proposed in this paper provides some ideas for deploying CNN models on satellite platforms and unmanned aerial platforms. There are still areas for improvement in our experimental approach, as this paper mainly focuses on the post-processing of remote sensing images and only provides a data augmentation scheme for pre-processing. However, in practical engineering, there are usually issues such as noise, distortion, or uneven lighting in the original images captured by satellites. Therefore, pre-processing operations are required before object detection and recognition. Pre-processing steps may include denoising, geometric correction, radiometric calibration, etc., to improve image quality and accuracy. Furthermore, satellite images often have large scales and high resolutions, requiring operations such as cropping, scaling, or segmentation to adapt to specific application scenarios or analysis needs. In future research, we plan to focus on pre-processing and optimize the mAP metric for post-processing object detection tasks, as well as further improve the detection accuracy of the algorithm model through knowledge distillation.