Next Article in Journal
Mitigating Distractor Challenges in Video Object Segmentation through Shape and Motion Cues
Previous Article in Journal
Using Convolutional Neural Networks for Blocking Prediction in Elastic Optical Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

OrientedDiffDet: Diffusion Model for Oriented Object Detection in Aerial Images

School of Computer and Information Engineering, Tianjin Chengjian University, Tianjin 300384, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(5), 2000; https://doi.org/10.3390/app14052000
Submission received: 26 December 2023 / Revised: 10 February 2024 / Accepted: 14 February 2024 / Published: 28 February 2024

Abstract

:
Object detection is a fundamental task of remote-sensing image processing. Most existing object detection detectors handle regression and classification tasks through learning from a fixed set of learnable anchors or queries. To simplify object candidates, we propose a denoising diffusion process for remote-sensing image object detection, which directly detects objects from a set of random boxes. During the training phase, the horizontal detection boxes are transformed into oriented detection boxes firstly. Then, the model learns to reverse this transformation process by diffusing from the ground truth-oriented box to a random distribution. During the inference phase, the model incrementally refines a set of randomly generated boxes to produce the final output result. Remarkable results have been achieved using our proposed method. For instance, on commonly used object detection datasets such as DOTA, our approach achieves a mean average precision (mAP) of 76.59%. Similarly, on the HRSC2016 dataset, our method achieves a 72.4% mAP.

1. Introduction

Object detection of remote-sensing images is a basic but challenging task in computer vision [1,2,3,4]. Most existing object detectors rely on prior knowledge and candidate boxes to perform regression on detection boxes [5,6,7]. These candidate boxes help simplify the object detection task by providing initial bounding box proposals. In contrast, our proposed method for object detection in remote-sensing images utilizes a diffusion model generation task. This approach enables the direct detection of object boxes from a set of random boxes. Essentially, the task of detecting the position (center coordinates) and size (width and height) of the bounding box is treated as a generative task within the object detection framework.
Remote-sensing images exhibit diverse scales, dense distribution, and arbitrary orientations, making it ineffective to represent objects solely with horizontal boxes [8,9,10]. In the field of remote-sensing image object detection, scholars have increasingly focused on algorithms based on oriented boxes. For elongated objects with significant lengths and widths in remote-sensing images (e.g., ports and ships) [10,11,12], the use of horizontal candidate regions introduces substantial background noise when the object’s rotation angle is large. This interference can negatively impact object classification. Therefore, we propose converting horizontal boxes into oriented boxes before applying the diffusion model for object detection in remote-sensing images. The oriented box representation employed in this paper utilizes a lightweight full convolutional network. The center point’s coordinates, position, and size of the edges of the horizontal box are used as parameters. To achieve this, the regression branch’s number of output parameters is changed from four to six. This approach significantly reduces the risk of overfitting compared to using rotated Region Proposal Networks (RPN), which typically have more parameters. Both horizontal and oriented boxes pose challenges during pooling for mask loss calculation in instance segmentation. To address this, we introduce the feature alignment algorithm that learns transformation parameters to convert from horizontal to oriented boxes. This module helps extract rotation-invariant features from oriented boxes, mitigating the rounding effect on the horizontal RoI (Region of Interest) caused by pooling.
Motivated by a class of likelihood-based models to generate the image by gradually removing noise from an image via the learned denoising model, we propose a denoising diffusion process from noisy boxes to object boxes. To enhance detection accuracy, we propose the acceleration diffusion process by adjusting the number of denoising sampling steps. Acceleration diffusion significantly enhances the sampling speed of both discrete-time and continuous-time diffusion probabilistic models.
Our first contribution lies in the introduction of the feature alignment algorithm. This algorithm effectively encodes the rotational invariance of the detector, resulting in high-quality aerial object detection. Another significant contribution of our work is the introduction of the DPM-Solver. This technique improves the performance and calibration of the diffusion model, enhancing the overall effectiveness of our approach. We conducted extensive experiments on well-known remotely sensed image datasets, including DOTA and HRSC2016. The results of these experiments confirm the effectiveness of our proposed approach. Detailed experimental findings, including accuracy metrics, performance comparisons, and calibration analyses, are presented in the paper.

2. Related Work

2.1. Object Detection

Remote-sensing image object detection often relies on prior anchors to represent oriented boxes [13,14,15]. Azimi et al. (2018) [16] proposed the unconstrained detection algorithm ICN (Image Cascade Network), which utilizes adaptive weight sharing to combine the image pyramid with the feature pyramid, thereby enriching the semantic information. However, it does not perform well in detecting small objects. Xu et al. (2021) [17] introduced the Gliding Vertex algorithm, which effectively slid the four vertices to represent boxes and significantly improved the precision of small object detection. Yang et al. (2021) [18] presented SCRDet (Small, Cluttered and Rotated object Detection) to address the angular boundary problem. Although these algorithms achieved good performance as two-stage detectors, the inherent complexity of natural structures led to a heavy computational burden.,Yang et al. (2019) [19] proposed R3Det (Refined Rotation RetinaNet), a single-stage object detector that fine-tuned the model to solve the mismatch problem between predicted boxes and ground truth boxes.) Qian et al. (2021) [20] proposed RSDet (Rotation Sensitive Detector), which improved its detection accuracy to 72.2% by improving the angular regression sensitive error. Therefore, how to improve the representation of the rotated boxes to avoid introducing the special periodic variable of angle or improve the sensitivity of angle regression is one of the main research directions of the later rotated boxes detection algorithm. Although single-stage rotated box detection algorithms exhibit slightly lower accuracy, they offer higher detection efficiency and can be widely applied in edge devices. While these methods have effectively improved the efficiency of one-stage detection algorithms, they still fall short in terms of accuracy. In this work, we propose a novel detector using oriented box representation, named OrientedDiffDet. We directly generate rotated detection boxes from a set of random rotated boxes using the diffusion model, eliminating the need for obtaining candidate anchors from prior knowledge.

2.2. Diffusion Models

Diffusion probability models (DPMs) are a class of generative models based on Markov chains [21,22]. They are capable of transforming simple distributions into complex ones, making them suitable for a wide range of computer vision tasks, including semantic segmentation [23,24], text-to-image generation [25], and speech synthesis [26,27].
Amit [28] proposed the use of diffusion models for the iterative rescaling of segmentation graphs. By applying the probabilistic diffusion model multiple times and merging the results, the final segmentation graph is obtained. Dmitry et al. [29] introduced intermediate activations of the network with Markov steps of the reverse diffusion process, enabling good semantic segmentation, even with a small number of training images. A. Rames [30] proposed using the diffusion model to invert the CLIP image encoder, allowing the generation of multiple images corresponding to a given image embedding. Chen [27] proposed a conditional model for waveform generation that estimates gradients of the data density, offering a trade-off between inference speed and sample quality by adjusting the number of refinement steps. Despite the significant interest in these ideas, there has not been a successful solution for generating diffusion models specifically for object detection in remote-sensing images. This is mainly due to the intrinsic challenges related to oriented representation and computational inefficiency.
There are two categories of existing fast samplers for DPMs. The first category includes methods like knowledge distillation and learning noise level or sample trajectory. T. Salimans [31] proposed the progressive distillation algorithm, which significantly reduces the sampling time of diffusion models in unconditional and class-conditional image generation. However, these methods often require a longer training period before effective sampling can be performed and have limited applicability and flexibility. M. W. Lam [32] introduced novel bilateral denoising diffusion models (BDDMs), which parameterize the forward and reverse processes using a score network and a scheduling network, respectively. This approach requires fewer steps to generate high-quality samples but may require effort to adapt to different models, datasets, and the number of sampling steps.
The second category is training-free samplers, which aim to generate high-quality samples without the need for extensive training. However, these methods, although offering comparable quality to ordinary samplers, still consume a considerable amount of time, typically requiring around 50 function evaluations. DPM-Solver was designed as a minimalistic form of diffusion ODEs (diffusive ordinary differential equation models) with the smallest possible error. In our work, we employ DPM-Solver to accelerate the inference process by applying it to the generation of oriented boxes.

3. Approach

3.1. Preliminaries

Diffusion models [22,33,34] utilize a stepwise denoising process to recover data samples from randomly distributed initial samples. This process is defined by a Markovian chain of diffusion forward processes, where the noise level of the data samples gradually increases.
The forward noise process is described by the equation:
q ( z t | z 0 ) = N ( z t | α ¯ t z 0 , ( 1 α ¯ t ) I ) ,
This equation transforms the data sample z 0 into a latent noisy sample z t at each step t by adding noise to z 0 . Here, α ¯ t is defined as α ¯ t : = s = 0 t α s = s = 0 t ( 1 β s ) , and β s denotes the noise variance schedule.
During training, a neural network f θ z t , t is trained to predict z 0 from z t by minimizing the 2 loss [21]:
L train = 1 2 | | f θ ( z t , t ) z 0 | | 2 .
In the inference phase, the data sample z 0 is reconstructed from the noise z T using the model f θ in an iterative manner, i.e., z T z T Δ z 0 . In our setup, the data samples represent a set of bounding boxes, where z 0 = b and b R N × 6 is a set of N boxes. A neural network f θ ( z t , t , x ) is trained to predict z 0 from the noisy oriented boxes z t , conditioned on the corresponding images x, and generate the corresponding category labels c.

3.2. Proposed Method

The architecture of our model consists of an image encoder and a detection decoder. The image encoder processes the original input image x to extract high-level features using a backbone network with an FPN (Feature Pyramid Network). The detection decoder takes a set of random proposal boxes as input to crop RoI features from the feature map generated by the image encoder. The RoI features are then sent to the detection head to refine the predicted boxes generated from the noisy boxes. During training, we first construct the diffusion process from ground-truth boxes to noisy boxes and then train the model to reverse this process. The noise scale is controlled by α t (in Equation (1)), which adopts the monotonically decreasing cosine schedule for α t in different time step t. During inference, the OrientedDiffDet performs a denoising sampling process to generate object boxes. The model starts with boxes sampled from a Gaussian distribution and progressively refines its predictions. Within the image encoder, we use oriented boxes instead of horizontal boxes to accurately identify objects with arbitrary orientations. To achieve rotation-invariant features for RoI classification and bounding box (bbox) regression, we employ a feature alignment method.
Please refer to Figure 1 for an illustration of the architecture.

3.2.1. Oriented Box Representation

To account for the diverse orientations of objects in remote-sensing images, we propose a oriented box representation method to utilize the box representation. To represent the position of each oriented anchor box, we employ six regression parameters, namely, ( x , y , w , h , Δ α , Δ β ) . Among these parameters, ( x , y ) represents the center coordinate of the outer rectangle of the oriented box, while w and h imply its width and height, respectively. The parameters Δ α and Δ β denote the offsets of the top edge and right midpoint of the outer rectangle of the oriented box.
We can calculate the coordinates of the four vertices representing the oriented box, including the coordinates ( x + Δ α , y h / 2 ) , ( x + w / 2 , y + Δ β ) , ( x Δ α , y + h / 2 ) , and ( x w / 2 , y Δ β ) .
By regressing the offsets of the midpoints of any two adjacent edges of the horizontal box, we can generate the oriented candidate boxes based on the regression of the original horizontal box. This representation enables us to capture the orientation of objects in a more precise manner while utilizing regression based on the original horizontal box. See Figure 2.

3.2.2. Feature Alignment

The detection decoder takes a set of oriented proposals as input. For each oriented proposal, we use Feature Alignment to extract a fixed-size feature vector from its corresponding feature map. The feature vector is then passed through two fully connected layers, and then two sibling fully connected layers. The output of the first layer is the probability that the proposal belongs to K + 1 classes, where K represents the object classes, and 1 represents the background class. The output of the second layer produces the offsets of the proposal for each of the K object classes.
Feature alignment is an operation that extracts rotation-invariant features from each oriented box. The oriented proposal created by the oriented RPN often takes the shape of a parallelogram (depicted by the blue box in Figure 3). This parallelogram is defined by the parameters v = ( v 1 , v 2 , v 3 , v 4 ) , where v 1 , v 2 , v 3 , and v 4 represent the vertex coordinates. However, for ease of calculations, we need to adjust each parallelogram to a rectangular shape with direction. We achieve this by extending the shorter diagonal (i.e., the line from v2 to v4 in Figure 3) of the parallelogram to have the same length as the longer diagonal. Following this simple operation, we obtain the oriented rectangular shape ( x , y , w , h , θ ) (depicted by the red box in Figure 3) from the parallelogram. Here, the parameter θ [ π / 2 , π / 2 ] is defined by the intersection angle between the horizontal axis and the longer side of the rectangular shape.
To feature align in the remote-sensing object detection, we begin by projecting the oriented rectangle onto the feature map F with a stride of s, resulting in an oriented RoI. Subsequently, each oriented RoI is partitioned into a grid of dimensions m × m grids (with m typically set to 7 s). This process yields a fixed-size feature map F with dimensions m × m × C . Within this feature map, the value of each grid cell indexed by ( i , j ) in the c-th channel is computed as follows:
F c   ( i , j ) = ( x , y ) area ( i , j ) F c ( R ( x , y , θ ) ) / n
Here, F c represents the feature of the c-th channel, n denotes the number of samples localized within each grid, and area ( i , j ) is the set of coordinates included in the grid indexed by ( i , j ) . The function R ( · ) represents an oriented RoI, resulting in the sampling points undergoing a rotation offset during bilinear interpolation. The offset values are computed as follows:
x = tan 1 h 2 Δ β w 2 + Δ α + center w
y = tan 1 h 2 Δ β w 2 + Δ α + center h
where center w and center h represent the anchor center coordinates, and the rotation angle θ is represented as w, h, Δ α , and Δ β .

3.2.3. Accelerated Diffusion

Random or estimated boxes from the previous step are sent to the detection decoder for category classification and box coordinate prediction in each sampling step. After obtaining the boxes of the current step, accelerated diffusion, which leverages the semi-linear structure of the diffusion ordinary differential equations (ODEs), is adopted to estimate the boxes for the next step. With an initial value x T and M + 1 time steps { t i } i = 0 M decreasing from t 0 = T to t M = 0 , we start with x ˜ t 0 = x T and iteratively compute the sequence x ˜ t i i = 1 M as follows:
x ˜ t i = α t i α t i 1 x ˜ t i 1 σ t i ( e h i 1 ) ϵ θ ( x ˜ t i 1 , t i 1 ) , where h i = λ t i λ t i 1
Accelerated diffusion approximates the exact solutions of diffusion ODEs by leveraging their semi-linearity through a simplified formulation involving an exponentially weighted integral of the noise prediction model. In our study, we apply the accelerated diffusion method to expedite the sampling process within our proposed OrientedDiffDet model for object detection in remote-sensing images. This implementation significantly enhances convergence speed and maximizes the utilization of known information embedded in the diffusion ODE.
The integration of accelerated diffusion results in our OrientedDiffDet model, which achieves swifter and more efficient inference. This translates to reduced sampling time for diffusion models, ultimately enhancing the performance of our object detection system. These improvements make our system well suited for real-time applications in remote-sensing image processing.
By incorporating this technique into our methodology, we not only address the inherent challenges of oriented representation and computational efficiency in object detection but also achieve faster and more precise detection results in the context of remote-sensing images.

4. Experiments

4.1. Datasets

To evaluate our proposed method, we conducted extensive experiments on two widely used object detection datasets: DOTA [35] and HRSC2016 [11].
DOTA. DOTA is currently the largest dataset for object detection of aerial images with bounding box annotations. It consists of 2806 large-size images across 15 categories, including baseball field (BD), ground track (GTF), small vehicle (SV), large vehicle (LV), tennis court (TC), basketball court (BC), storage tank (ST), football field (SBF), roundabout (RA), swimming pool (SP), and helicopter (HC). The fully annotated DOTA dataset contains 188,282 instances. Instances in this dataset exhibit variations in scale, orientation, and aspect ratio. The DOTA dataset provides an evaluation server. We used the training sets for training and the test set for testing. We performed limited data augmentation. After image remapping, we cropped a series of 1024 × 1024 patches from the original image at a stride length of 800.
HRSC2016. HRSC2016 is a challenging dataset for ship detection in aerial images, collected from Google Earth. It comprises 1061 images and includes more than 20 categories of ships with various appearances. The image sizes range from 300 × 300 to 1500 × 900. The training, validation, and test sets consist of 436, 181, and 444 images, respectively. For data augmentation, we only employed horizontal flipping. The images were adjusted to a size of (512, 800), where 512 represents the length of the shorter edge, and 800 represents the maximum length of the image.

4.2. Implementation Details

We initialized the backbone network with pre-trained weights from ImageNet-1k [36]. We trained the network using the ADAMW optimizer with an initial learning rate of 2.5 × 10 5 and a weight decay of 1 × 10 4 . All models were trained on 2 GPUs with a small batch size of 16. For DOTA, the default training schedule consisted of 450,000 iterations, with the learning rate divided by 10 at the 350,000th and 420,000th iterations. For HRSC, the training schedule consisted of 360,000 iterations.
During the inference phase, the detection decoder iteratively refines the predictions based on a Gaussian random box. We selected the top 100 predictions for DOTA and the top 300 predictions for HRSC2016 based on their scores. Predictions at each sampling step were integrated using non-maximum suppression (NMS) to obtain the final predictions.

4.3. Comparisons with the State-of-the-Art Methods

Results on DOTA. As shown in Table 1, we compared our OrientedDiffDet with other state-of-the-art methods on the DOTA dataset. Our single-scale model achieves 76.59% mAP, outperforming the previous models. Additionally, we use DiffusionDet as a baseline model. This model utilizes horizontal boxes, and our model uses oriented boxes, which achieves substantial improvement for the baseline model (67.16% vs. 76.59%). This suggests that oriented boxed in remote-sensing images can capture the orientation of an object in a more precise manner. While our model delivered substantial performance enhancements across most categories, a slight decline was observed in a few classes when compared to previous methods. These include categories like baseball diamond, ground track field, soccer ball field, and others. We attribute this phenomenon to the inherent similarity in shapes and the potential for confusion between their contours and the background. We give some qual- itative results of OrientedDiffDet on DOTA in Figure 4.
In terms of Frames Per Second (FPS), our OrientedDiffDet exhibited exceptional inference speed, achieving an impressive rate of 11.3 FPS. This performance outpaces the majority of detectors, underscoring its ability to process remote-sensing images rapidly and efficiently. It is worth mentioning that while OrientedDiffDet is slightly higher than DiffusionDet in terms of FPS (11.3 vs. 10.8), it maintains a superior level of accuracy. Our approach effectively strikes a balance between accuracy and speed, making it a compelling choice for various remote-sensing image-processing tasks.
First, we need to understand the principle of OrientedDiffDet. This method performs object detection on remote-sensing images through a combination of rotated boxes and the diffusion model. Rotated boxes help identify local features in an image, while the diffusion model helps capture global information. This combination gives OrientedDiffDet high accuracy in detecting targets in remote-sensing images. On the other hand, the advantage of OrientedDiffDet in processing speed also deserves attention. By adopting efficient algorithms, it is able to quickly analyze images during the inference process. This allows OrientedDiffDet to still maintain a high FPS when processing large-scale remote-sensing images, thus improving the overall work efficiency. In practical applications, remote-sensing image-processing tasks often require a trade-off between speed and accuracy. The emergence of OrientedDiffDet just meets this demand. It has not only achieved good results in surface cover classification, target detection, and change detection but also performed well in various remote-sensing image-processing competitions. This fully demonstrates the practical value and wide applicability of OrientedDiffDet in the field of remote-sensing image processing. In conclusion, OrientedDiffDet, as an efficient and accurate remote-sensing image-processing method, finds a perfect balance between speed and accuracy. Its emergence has brought new possibilities for the field of remote-sensing image processing, and is expected to lead the development trend of remote-sensing image-processing technology in the future.
Results on HRSC. To assess the effectiveness and practicality of our method in tackling the intricate challenges of remote-sensing image object detection, particularly in the domain of ship detection, we conducted a comprehensive evaluation on the HRSC dataset. The results of this evaluation are summarized in Table 2. As illustrated in the table, our method has emerged as a frontrunner, achieving a remarkable 90.3 mAP at AP50. Noteworthy is the significant enhancement in AP75, underscoring our model’s exceptional precision in object localization. These results validate the robustness and practicality of our approach in addressing the complexities inherent in remote-sensing image object detection tasks, particularly in scenarios involving ship detection. In this evaluation, we employed multiple evaluation metrics to measure the performance of our method on the HRSC dataset. In addition to mAP (average accuracy), we also focused on the various quantiles of AP (average accuracy), such as AP50 and AP75. These metrics facilitate a more comprehensive understanding of our method’s performance in various detection tasks. The results in Table 2 indicate that our method has achieved 90.3 mAP on the AP50 index, which has reached the industry-leading level. At the same time, our model achieved a significant improvement in the AP75 index, which further confirmed its accuracy in target positioning. This result not only demonstrates the superiority of our method in the remote-sensing image target detection task, but also proves its practical value in complex scenarios (such as ship detection). In conclusion, through a comprehensive evaluation on the HRSC dataset, we confirm the effectiveness and utility of our approach in addressing the complex challenges in the detection task of remote-sensing image-based targets. Especially in the particular field of ship detection, our method has achieved excellent results in various evaluation indicators, which lays a foundation for further research in this field. In future studies, we will continue to optimize our method in order to provide more efficient solutions for remote-sensing image object detection in more scenarios.

4.4. Ablation Studies

We conduct ablation experiments on DOTA to study OrientedDiffDet in detail to verify the effectiveness of the oriented box representation, feature alignment and accelerated diffusion. All experiments use ResNet-50 with FPN as the backbone and 2000 boxes for training and inference without further specification.
The experimental results presented in Table 3 highlight the effectiveness of our model in achieving the accurate localization and recognition of objects as evidenced by the significant increase in mAP (+5.76%). It is worth noting that the inclusion of oriented box representation and feature alignment individually also leads to performance improvements, with an increase of 2.73 points and 1.01 points in mAP, respectively. This underscores the superior performance of the two proposed modules. It is important to mention that without accelerated diffusion, our OrientedDiffDet model exhibits slower inference speed compared to the original DiffusionDet (11.9 s vs. 10.8 s). In contrast, the accelerated model achieves a reduction in inference time (−0.6 s). This trade-off between speed and accuracy highlights the significance of our accelerated diffusion technique in improving the overall efficiency of the object detector.
To better understand OrientedDiffDet, we divided its main components into three aspects: box-oriented representation, feature alignment, and accelerated diffusion. First, box-oriented representation as a key element in the detection task contributes to improving detection accuracy. By dividing the target object into multiple oriented box, we can capture the object shape and pose more precisely. This representation helps to distinguish between similar objects and thus improve the detection performance. Second, feature alignment is used for spatially invariant feature extraction and can effectively improve target classification and positional regression. This alignment helps to preserve the details in the image and improve the performance of the detector. Finally, the accelerated diffusion module is important in improving the detection speed. OrientedDiffDet accelerated diffusion has a fast inference speed while maintaining high detection accuracy. Ablation experiments performed on the DOTA dataset demonstrated the effectiveness of OrientedDiffDet. We evaluated the effect of three parts, box-oriented representation, feature alignment, and accelerated diffusion, on detection performance separately. The experimental results show that all three parts positively affect the detection performance. In particular, for the box-oriented representation and feature alignment, they play a key role in improving the detection accuracy. The accelerated diffusion module improves the inference speed while ensuring the detection performance. The experimental results show that the combination of these three key techniques enables our model to achieve higher accuracy in locating and identifying objects, while improving the overall efficiency without sacrificing too much inference speed. This provides useful lessons and implications for future object detection studies.

4.5. Discussions

We also conduct some experiments to discuss the option of parameters in our method and determine the optimal settings.
  • Signal scaling.
In the course of the diffusion process, the signal scaling factor plays a pivotal role in regulating the signal-to-noise ratio (SNR). As revealed in Table 4, our experiments demonstrate that a scaling factor of 1.0 yields the most optimal performance. It is worth noting that when generating an oriented box, six parameters are involved as opposed to the four required for a horizontal box. Our OrientedDiffDet model benefits from training targets standardized to a value of 1.0, which simplifies the training process and contributes to its superior performance.
  • Box renewal threshold.
In cases where the prediction scores fall below a certain threshold, reactivation is employed to enhance the generative boxes. The impact of different score thresholds for box renewal is illustrated in Table 5. From the results, it is evident that a threshold value of 0.3 yields slightly superior performance compared to other thresholds. Consequently, in our experiments, we opt for a box renewal threshold of 0.3.
  • Comparison using accelerated diffusion modules.
We compare different accelerated diffusion modules in Table 6. In this study, we made an in-depth comparison of different types of samplers, including methods such as DDPM [21] and DDIM [43]. By analyzing the performance of various samplers, we found that the accelerated diffusion sampler has a significant advantage in the repeated calculation of a reduced coefficient. First, compared to the other sampler, the accelerated diffusion sampler focuses on the dynamic changes of the coefficient during the calculation, thus reducing the number of repeated calculations. With this optimization, we implemented the accelerated diffusion sampler to generate high-quality samples faster with the same computational resources. The experimental results show that the accelerated diffusion sampler generates an increased sample speed under the same conditions compared to our implemented DDPM and DDIM samplers. In particular, the advantage of the accelerated diffusion sampler is more obvious when sampling the high-dimensional complex distribution. Through further optimization of the accelerated diffusion sampler, we believe that there is much room for its performance in practical applications. For example, further studies can be conducted on how to better balance the diffusion velocity in all directions during accelerated diffusion to improve the sampling efficiency. Moreover, the accelerated diffusion sampler can be combined with other optimization algorithms to achieve better sampling results in a wider field. In conclusion, the accelerated diffusion sampler proposed in this paper has achieved remarkable results in reducing repeated calculations and providing a new idea for efficient sampling. In the future work, we will further explore the potential of the accelerated diffusion sampler in order to exert its advantages in more practical application scenarios.

5. Conclusions

Our work introduces a novel approach to oriented object detection in remote-sensing images, combining oriented box representation, feature alignment, and accelerated diffusion. Extensive experiments conducted on the DOTA and HRSC2016 datasets have demonstrated the exceptional performance of our model in terms of both accuracy and speed. Future research directions may involve the further refinement and enhancement of our proposed models and methodologies. Some potential areas for exploration include tackling challenges associated with complex backgrounds, developing effective occlusion handling techniques, and enhancing scalability to accommodate larger datasets. Additionally, exploring the integration of multi-modal data sources, such as LiDAR or hyperspectral imagery, could open up new avenues for improving object detection performance in remote-sensing applications. These directions represent promising avenues for advancing the field and addressing the evolving demands of remote-sensing image analysis.

Author Contributions

Conceptualization, L.W. and H.D.; methodology, L.W.; validation, J.J. and H.D.; formal analysis, J.J.; investigation, J.J. and H.D.; data curation, J.J.; writing—original draft preparation, L.W.; writing—review and editing, L.W. and H.D.; supervision, L.W.; project administration, L.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No.62204168).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

  1. Cheng, G.; Zhou, P.; Han, J. RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 2884–2893. [Google Scholar] [CrossRef]
  2. Fu, K.; Chen, Z.; Zhang, Y.; Sun, X. Enhanced Feature Representation in Detection for Optical Remote Sensing Images. Remote Sens. 2019, 11, 2095. [Google Scholar] [CrossRef]
  3. Wang, G.; Wang, X.; Fan, B.; Pan, C. Feature Extraction by Rotation-Invariant Matrix Representation for Object Detection in Aerial Image. IEEE Geosci. Remote Sens. Lett. 2017, 14, 851–855. [Google Scholar] [CrossRef]
  4. Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
  5. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  6. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  7. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
  8. Han, J.; Ding, J.; Li, J.; Xia, G.S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
  9. Qiu, H.; Li, H.; Wu, Q.; Meng, F.; Ngan, K.N.; Shi, H. A2RMNet: Adaptively Aspect Ratio Multi-Scale Network for Object Detection in Remote Sensing Images. Remote. Sens. 2019, 11, 1594. [Google Scholar] [CrossRef]
  10. Li, Y.; Huang, Q.; Pei, X.; Jiao, L.; Shang, R. RADet: Refine Feature Pyramid Network and Multi-Layer Attention Network for Arbitrary-Oriented Object Detection of Remote Sensing Images. Remote. Sens. 2020, 12, 389. [Google Scholar] [CrossRef]
  11. Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images With Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
  12. Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
  13. Liao, W.; Chen, X.; Yang, J.; Roth, S.; Goesele, M.; Yang, M.Y.; Rosenhahn, B. LR-CNN: Local-aware Region CNN for Vehicle Detection in Aerial Imagery. arXiv 2020, arXiv:2005.14264. [Google Scholar] [CrossRef]
  14. He, X.; Ma, S.; He, L.; Ru, L.; Wang, C. Multi-Sector Oriented Object Detector for Accurate Localization in Optical Remote Sensing Images. Remote. Sens. 2021, 13, 1921. [Google Scholar] [CrossRef]
  15. Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans. Multim. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
  16. Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery. In Asian Conference on Computer Vision; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 150–165. [Google Scholar] [CrossRef]
  17. Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
  18. Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February 2021; pp. 3163–3171. [Google Scholar]
  19. Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8231–8240. [Google Scholar] [CrossRef]
  20. Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning Modulated Loss for Rotated Object Detection. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February 2021; pp. 2458–2466. [Google Scholar]
  21. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; 2020. [Google Scholar]
  22. Song, Y.; Ermon, S. Generative Modeling by Estimating Gradients of the Data Distribution. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R., Eds.; 2019; pp. 11895–11907. [Google Scholar]
  23. Kim, B.; Oh, Y.; Ye, J.C. Diffusion Adversarial Representation Learning for Self-supervised Vessel Segmentation. arXiv 2022, arXiv:2209.14566. [Google Scholar] [CrossRef]
  24. Wolleb, J.; Sandkühler, R.; Bieder, F.; Valmaggia, P.; Cattin, P.C. Diffusion Models for Implicit Image Segmentation Ensembles. In Proceedings of the International Conference on Medical Imaging with Deep Learning, MIDL 2022, Zurich, Switzerland, 6–8 July 2022; Proceedings of Machine Learning Research. Konukoglu, E., Menze, B.H., Venkataraman, A., Baumgartner, C.F., Dou, Q., Albarqouni, S., Eds.; 2022; Volume 172, pp. 1336–1348. [Google Scholar]
  25. Chen, N.; Zhang, Y.; Zen, H.; Weiss, R.J.; Norouzi, M.; Chan, W. WaveGrad: Estimating Gradients for Waveform Generation. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
  26. Ho, J.; Salimans, T.; Gritsenko, A.A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video Diffusion Models. arXiv 2022, arXiv:2204.03458. [Google Scholar] [CrossRef]
  27. Chen, N.; Zhang, Y.; Zen, H.; Weiss, R.J.; Norouzi, M.; Dehak, N.; Chan, W. WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. In Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021; Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P., Eds.; 2021; pp. 3765–3769. [Google Scholar] [CrossRef]
  28. Amit, T.; Nachmani, E.; Shaharabany, T.; Wolf, L. SegDiff: Image Segmentation with Diffusion Probabilistic Models. arXiv 2021, arXiv:2112.00390. [Google Scholar]
  29. Baranchuk, D.; Rubachev, I.; Voynov, A.; Khrulkov, V.; Babenko, A. Label-Efficient Semantic Segmentation with Diffusion Models. arXiv 2021, arXiv:2112.03126. [Google Scholar]
  30. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
  31. Salimans, T.; Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
  32. Lam, M.W.Y.; Wang, J.; Huang, R.; Su, D.; Yu, D. Bilateral Denoising Diffusion Models. arXiv 2021, arXiv:2108.11514. [Google Scholar]
  33. Zhang, Q.; Tao, M.; Chen, Y. gDDIM: Generalized denoising diffusion implicit models. arXiv 2022, arXiv:2206.05564. [Google Scholar] [CrossRef]
  34. Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015; JMLR Workshop and Conference, Proceedings. Bach, F.R., Blei, D.M., Eds.; 2015; Volume 37, pp. 2256–2265. [Google Scholar]
  35. Xia, G.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.J.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
  36. Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
  37. Ding, J.; Xue, N.; Long, Y.; Xia, G.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 2849–2858. [Google Scholar] [CrossRef]
  38. Prete, R.D.; Graziano, M.D.; Renga, A. RetinaNet: A deep learning architecture to achieve a robust wake detector in SAR images. In Proceedings of the 6th IEEE International Forum on Research and Technology for Society and Industry, RTSI 2021, Naples, Italy, 6–9 September 2021; pp. 171–176. [Google Scholar] [CrossRef]
  39. Chen, S.; Sun, P.; Song, Y.; Luo, P. DiffusionDet: Diffusion Model for Object Detection. arXiv 2022, arXiv:2211.09788. [Google Scholar] [CrossRef]
  40. Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 3500–3509. [Google Scholar] [CrossRef]
  41. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar] [CrossRef]
  42. Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. RepPoints: Point Set Representation for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9656–9665. [Google Scholar] [CrossRef]
  43. Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. OpenReview.net. [Google Scholar]
Figure 1. Overall framework of OrientedDiffDet framework. The image encoder extracts feature representation from an input image. These is a two-stage detector built on an FPN backbone. The first stage generates oriented proposals by the diffusion model network, and the second stage uses feature alignment as input and to predict category classification.
Figure 1. Overall framework of OrientedDiffDet framework. The image encoder extracts feature representation from an input image. These is a two-stage detector built on an FPN backbone. The first stage generates oriented proposals by the diffusion model network, and the second stage uses feature alignment as input and to predict category classification.
Applsci 14 02000 g001
Figure 2. An example of representation of rotating point box.
Figure 2. An example of representation of rotating point box.
Applsci 14 02000 g002
Figure 3. Illustration of oriented feature alignment. Blue box is a proposal generated by diffusion model, and the red box is its corresponding rectangular proposal used for projection and feature alignment.
Figure 3. Illustration of oriented feature alignment. Blue box is a proposal generated by diffusion model, and the red box is its corresponding rectangular proposal used for projection and feature alignment.
Applsci 14 02000 g003
Figure 4. Visualization of results from OrientedDiffDet in DOTA.
Figure 4. Visualization of results from OrientedDiffDet in DOTA.
Applsci 14 02000 g004
Table 1. Comparisons with state-of-the-art methods on the DOTA dataset.
Table 1. Comparisons with state-of-the-art methods on the DOTA dataset.
RoI_Trans [37]R3Det [18]RRetinaNet [38]DiffusionDet [39]S2ANet [8]Ours
PL88.6488.7688.6788.2089.1188.84
BD78.5283.0977.6277.0382.8482.69
BR43.4450.9141.8141.1048.3753.94
GTF75.9267.2758.1752.0971.1174.98
SV68.8176.2374.5871.2578.1178.95
LV73.6880.3971.6470.0878.3984.89
SH83.5986.7279.1176.5587.2588.86
TC90.7490.7890.2988.6390.8390.93
BC77.2784.6882.1880.1684.9087.84
ST81.4683.2474.3274.4385.6485.75
SBF58.3961.9854.7553.9660.3661.67
RA53.5461.3560.6057.3762.6060.45
HA62.8366.9162.5760.2365.2675.79
SP58.9370.6369.6761.4769.1368.65
HC47.6753.9460.6454.9257.9464.67
mAP69.5673.7968.4367.1674.1276.59
FPS11.314.716.110.815.111.3
Table 2. Performance comparison on the HRSC dataset.
Table 2. Performance comparison on the HRSC dataset.
APAP_50AP_75
DiffusionDet [39]0.5060.7610.606
RoI_Trans [37]0.6140.8920.758
ORCNN [40]0.5470.8960.637
FCOS [41]0.5290.8830.620
RepPoints [42]0.5180.8370.591
ours0.7240.9030.895
Table 3. Ablation experiment on the DOTA validation set.
Table 3. Ablation experiment on the DOTA validation set.
Oriented Box RepresentationFeature AlignmentAccelerated DiffusionmAPFPS
73.2010.8
✓ * 75.9311.7
74.2111.9
78.9611.3
* A checkmark indicates that this module is selected.
Table 4. Signal scale. A large scaling factor can improve detection performance.
Table 4. Signal scale. A large scaling factor can improve detection performance.
Scale0.11.02.03.0
mAP77.8878.9678.2178.21
Table 5. The threshold of 0.3 works best.
Table 5. The threshold of 0.3 works best.
Score Thresh.0.00.30.50.7
mAP77.0478.9677.3777.3
Table 6. Comparison using accelerated diffusion modules.
Table 6. Comparison using accelerated diffusion modules.
DDPMDDIMAccelerated Diffusion (Ours)
FPS10.811.2011.30
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, L.; Jia, J.; Dai, H. OrientedDiffDet: Diffusion Model for Oriented Object Detection in Aerial Images. Appl. Sci. 2024, 14, 2000. https://doi.org/10.3390/app14052000

AMA Style

Wang L, Jia J, Dai H. OrientedDiffDet: Diffusion Model for Oriented Object Detection in Aerial Images. Applied Sciences. 2024; 14(5):2000. https://doi.org/10.3390/app14052000

Chicago/Turabian Style

Wang, Li, Jiale Jia, and Hualin Dai. 2024. "OrientedDiffDet: Diffusion Model for Oriented Object Detection in Aerial Images" Applied Sciences 14, no. 5: 2000. https://doi.org/10.3390/app14052000

APA Style

Wang, L., Jia, J., & Dai, H. (2024). OrientedDiffDet: Diffusion Model for Oriented Object Detection in Aerial Images. Applied Sciences, 14(5), 2000. https://doi.org/10.3390/app14052000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop