1. Introduction
Remote sensing technology, a vital tool for observing and comprehending our environment, has seen significant change throughout the years. Deep learning algorithms have emerged to be effective tools for analyzing remote sensing images and extracting valuable data, as a result of substantial changes brought about by advances in artificial intelligence. Rapid advancements in deep learning algorithms have enabled a variety of object detection techniques that have recently been implemented in remote sensing applications such as defense [
1], agriculture [
2], resource exploration [
3], surveillance [
4], disaster management [
5], and environmental monitoring [
6].
Two types of remote sensing object detection methods exist which are two-stage and one-stage. Faster R-CNN [
7] and R-FCN [
8] are two-stage approaches that use a region proposal network (RPN) to identify regions of interest in an image, followed by classification and regression procedures. Despite their excellent precision, these approaches are inefficient because of the significant calculation required. One-stage approaches, on the other hand, such as SSD [
9] and YOLO [
10], enhance inference speed by acquiring the final position and classification prediction at the same time, albeit with a minor loss of precision. Regardless of their building-block structure, most conventional object detection methods use a horizontal bounding box proposal for the detected objects. However, due to the complexity of remote sensing images, such as their top-down aspect and high resolution, typical horizontal object detection algorithms struggle to acquire excellent results.
Researchers proposed rotating bounding box proposal-based object detection approaches to increase the performance of object detection models in complex remote sensing scenarios. Existing rotating object detection algorithms use predefined anchor boxes with variable scales, aspect ratios, and angles to identify the arbitrary orientation and multiscale objects. To decrease computation, some algorithms, such as RoI-Transformer [
11], learn the transition from horizontal to rotational RoIs. Similarly, Oriented R-CNNs [
12] use rotation-capable bounding boxes to adapt to changing object orientations in remote sensing imagery. Other approaches, such as CAD-Net [
13] and Info-FPN [
14], build robust feature representations using contextual information or multiscale representation. Other approaches have been used to improve object detection in remote sensing images, such as refining sampling to increase small-item detection and building feature refinement modules for the correct alignment of the rotation anchor box. Nonetheless, despite these promising improvements, a number of problems remain that limit the use of deep learning-based object detection systems in remote sensing applications.
One of the most common problems in object detection is overfitting, which occurs when models perform well on training data but badly on unseen data. The lack of uncertainty quantification in object detection, which is critical for risk-sensitive applications, exacerbates the problem. Furthermore, standard feature extraction algorithms frequently fail to adequately capture and extract multiscale features, resulting in information loss and unsatisfactory model performance. In addition, the models’ effectiveness is hampered by a lack of high-resolution images, which are required for accurate object detection. Finally, retrieving background information for detected objects is a critical challenge in refining these models for diverse remote sensing applications.
The abovementioned issues are discussed in detail below:
Overfitting happens when a model learns the precise characteristics and noise in the training dataset to such an extent that it significantly impairs the model’s performance on unseen, real-world data during its training phase. As a result, the model performs well on training data while struggling to sustain that level of performance on new, previously unknown data. This has severe ramifications in remote sensing because these models are frequently employed to detect crucial environmental and geographical aspects. An overfit model may miss crucial traits or mistakenly identify irrelevant ones, resulting in incorrect interpretations and conclusions. Furthermore, the absence of uncertainty quantification indicates that the model’s predictions lack a measure of confidence or error estimation. In other words, it is impossible to determine how reliable the model’s prediction is for unseen data, which is especially problematic for risk-sensitive applications where knowing the margin of error is critical. As a result, several strategies must be investigated that can not only prevent overfitting but also enable uncertainty estimation of detected objects;
- 2.
Information loss in feature extraction
A major concern in the field of remote sensing object detection is the issue of information loss during feature extraction, which serves as the basis for effective object detection. Traditional feature extraction models, which are frequently used as the backbone networks for object detection systems, fail to acquire and preserve all of the important information from remote sensing images. This information loss can have a severe influence on object detection performance since key cues for identifying objects may be missed during feature extraction. Moreover, the feature pyramid network (FPN), a popular component of these systems, faces comparable difficulties. The FPN’s inner lateral connections, which are intended to integrate high-level semantic information with low-level specific information, also suffer from information loss. This can limit the network’s ability to handle objects of varying sizes.
Super-resolution techniques can be used to address this issue. These techniques try to generate a higher-resolution image from low-resolution images, effectively enhancing the level of detail and boosting object detection accuracy. Furthermore, the use of these techniques allows for the investigation of the effect of higher resolution on the detection accuracy of object detection models. This type of study can provide useful insights into the relationship between image resolution and identification performance, driving future advancements, particularly those with minute or delicate characteristics. As a result, it is critical to propose a new backbone network and modify the FPN’s inner lateral connection module;
- 3.
Remote sensing image resolution
High-resolution images are critical for accurate object detection and interpretation in remote sensing images. However, one of the most common problems is the lack of high-resolution images. Remote sensing platforms may not always be able to record and send high-resolution images due to different constraints such as sensor limitations, transmission bandwidth, or storage capacity. The use of lower-resolution images in object detection might result in the loss of fine features, lowering the performance of detection algorithms and resulting in inaccurate or missed detections.
- 4.
Background interpretation
The implementation of remote sensing applications relies significantly not only on the detected objects but also on the context or background information surrounding them. However, existing object detection models frequently neglect the substantial information contained in the background and only prioritize the objects of interest. This lack of context can restrict the interpretability and applicability of detection results in a variety of scenarios. For instance, understanding the environmental background (urban, forest, water bodies, etc.) might have a big impact on how the identified objects (vehicles, buildings, plants, etc.) in the images are interpreted. In addition, certain applications, such as disaster management and land use planning, require comprehensive spatial data that includes both the objects of interest and their surroundings. To circumvent this limitation, background classification must be incorporated into the object detection model. This strategy would enable a more comprehensive comprehension of the detected scene by providing additional context, thereby enhancing the overall effectiveness and adaptability of remote sensing applications.
In this paper, we propose a Bayes by backpropagation (BBB)-based system for extracting detection and classification information from remote sensing images. First, we presented the Bayes R-CNN object detection model, which uses a BBB convolution layer to prevent overfitting and enables our model to compute epistemic and aleatory uncertainty for each detected object. A novel backbone model called the multi-resolution extraction network (MRENet) is presented to effectively extract features from remote sensing images. In addition, the traditional FPN is modified by incorporating the multi-level feature fusion module (MLFFM) in the inner lateral connection part to improve feature extraction and by introducing a novel local–global attention module called the Bayesian distributed lightweight attention module (BDLAM) to extract both local and global features from feature maps. Before the prediction phase, the system also incorporates Bayesian super-image-resolution techniques to improve image quality and system performance. In the final stage, the background information extractor based on MRENet is added to provide background information for the detected images. The abbreviation of the used method in this study is given in
Table 1.
This paper’s primary contributions are outlined below:
This paper introduces the Bayes R-CNN object detection framework, designed to enhance the reliability of object detection within remote sensing imagery through mitigating overfitting and offering estimates of uncertainty;
A novel backbone network called MRENet is proposed to replace traditional backbones, and the lateral connection of the FPN is replaced with MLFFM to preserve relevant features;
BDLAM is proposed in the FPN to improve the feature extraction and reduce the positional information loss during the feature extraction;
Bayesian super image resolution is embedded in the prediction step to generate high-resolution images from low-resolution images prior to object detection to improve the detection performance;
The MRENet is then implemented on the predicted images to classify the background of the detected object to provide robust interpretation of remote sensing scene.
6. Experimental Results and Analysis
This section explores different experimental methods and metrics to demonstrate the robustness of the proposed Bayesian object detection model. We first described the experimental platform and relevant parameters to reproduce the acquired results presented in this section. Different evaluation metrics which are commonly used for detection are described in this section. The training and testing analysis of the proposed model is also presented in this section. Then, the evaluation of the prediction through uncertainty estimation is performed to demonstrate the uncertainty estimation capability of the proposed model. Lastly, the object detection prediction results and background retrieval information are presented to demonstrate the Bayesian objection detection future potentials.
6.1. Experimental Platforms and Parameter
The training and testing analysis was performed in the ubuntu operating system, which has an AMD Ryzen 5 3500X processor and the NVIDIA RTX 3080 GPU. Python 3.7 with the PyTorch deep learning environment library was used to run the relevant code.
6.2. Bayesian Convolutional Neural Network
In this study, metrics such as precision, recall, and mean Average Precision (
mAP) were applied to detection tasks, while accuracy, recall, precision, and
F1-score were used for background classification. Validation and testing involve analyzing true positives (
TP), false positives (
FP), and false negatives (
FN). The relevant equations can be seen below:
6.3. Bayes R-CNN Result Analysis
The proposed object detection model was trained using different backbones such as ResNet50, ResNext50, Wide ResNet50, ShuffleNet V2, MobileNet V3-L, ResNet101, RegNet, and MRENet. We then calculated
mAP to evaluate the performance of the proposed model. The same hyperparameter was used for all backbones throughout the training process. The learning rate was set to 0.01, and the gamma (
γ) of the learning rate scheduler was set to 0.2.
Table 2 shows the acquired results of the proposed model for all backbones. It can be seen from
Table 2 that the proposed model with ShuffleNet V2 2× performed worse compared to the other backbones. On the contrary, the proposed model with MRENet outperforms other models across all evaluation metrics. Therefore, the MRENet backbone was used to perform hyperparameter optimization to acquire the best result.
Table 3 compares the
mAP values obtained with our proposed Bayes R-CNN architecture when employing different backbone networks. The Bayes R-CNN obtains an
mAP of 86.87 using the ResNet50 backbone. When the ResNext50 backbone is used, the
mAP increases marginally to 87.79. When the Wide ResNet50 backbone is used, the accuracy rises to 88.26. When the ShuffleNet V2 and MobileNet V3-L backbones are employed, there is a substantial drop in performance, with
mAP ratings of 80.94 and 83.65, respectively. The adoption of the ResNet101 and RegNet backbones results in significant performance increases, with
mAP values of 87.51 and 89.08, respectively. Finally, the Bayes R-CNN framework outperforms all other backbones with the greatest
mAP of 91.23 when using the MRENet backbone.
The learning rate plays an important role in achieving the best result in the object detection system. A different gamma rate from the learning rate scheduler with the same learning rate was used to train the proposed model. This study used gamma values such as 0.3, 0.2, and 0.1 with a learning rate of 0.01 to train the proposed model.
Table 4 shows the hyperparameter’s effect on the proposed model’s performance. It can be seen from
Table 4 that the proposed model with a 0.1 gamma rate acquired the best result achieving an
mAP of 74.63 on the DIOR dataset and 91.23 on the HRSC2016 dataset with a 0.2 gamma rate.
Table 5 presents an ablation study for the proposed object detection model, evaluating various configurations of MRENet, BDLAM, and MLFFM submodules. This study assesses the performance of each combination in terms of
mAP across the DIOR and HRSC2016 datasets.
MRENet achieved an mAP of 72.67% when evaluated separately on the DIOR dataset, MLFFM had an mAP of 71.21%, and BDLAM had an mAP of 72.08%. Meanwhile, MRENet, MLFFM, and BDLAM obtained mAPs of 90.16%, 90.02%, and 90.37%, respectively, on the HRSC2016 dataset.
The mAP was increased to 73.26% on DIOR and 90.91% on HRSC2016 when MRENet and MLFFM were added together. On DIOR and HRSC2016, combining MRENet and BDLAM yielded mAPs of 73.58% and 91.04%, respectively, whereas combining MLFFM and BDLAM produced mAPs of 73.05% and 90.78%.
The best results were obtained when all three modules (MRENet, MLFFM, and BDLAM) were used simultaneously, resulting in mAPs of 74.63% on DIOR and 91.23% on HRSC2016. These findings imply that combining these modules within a Bayes R-CNN improves the overall performance of the Bayes R-CNN.
Figure 11 shows a visual comparison of the feature map visualizations derived from three different methods: the traditional FPN, MLFFM, and the combined MLFFM + BDLAM approach. In the feature map produced using the traditional FPN, there are noticeable inaccuracies in feature extraction. This is manifested through both false negatives and missing region-of-interest features, leading to an imprecise representation of the essential features within the image. Transitioning to the MLFFM method, there is a visible improvement in the extraction process. The feature map under this method is more accurate, capturing the relevant features with greater precision. However, it is crucial to note that while missing region-of-interest features are mitigated, there are minor occurrences of false negatives, indicating slight omissions in the feature extraction process. In contrast, the feature map generated with the integration of MLFFM and BDLAM presents a significant enhancement in accuracy. The MLFFM + BDLAM method not only accurately extracts features but also meticulously emphasizes the most crucial and relevant ones. The BDLAM component plays a pivotal role in this refinement, directing the model’s attention effectively to focus on and highlight the features of paramount importance.
Figure 12 shows the visualization of the proposed Bayes R-CNN model’s object detection performance on the DIOR dataset under different background scenarios. The presented results clearly demonstrate the model’s resilience to different and complicated backgrounds that are frequently encountered in real-world circumstances. The Bayes R-CNN has clearly demonstrated improved object detection and localization capabilities, accurately separating regions of interest against complex backgrounds. The model’s ability to retain high accuracy in a variety of environments demonstrates its versatility and durability, both of which are critical for object detection in a state-of-the-art dataset like DIOR. As a result, this image emphasizes the Bayes R-CNN model’s ability to handle complex object detection scenarios.
Figure 13 shows a visual illustration of the capabilities of the Bayes R-CNN model applied to the HRSC2016 dataset under various background scenarios. These findings highlight the model’s ability to handle a wide range of backgrounds, which is crucial for real-world object detection tasks. Objects are efficiently detected and located using the Bayes R-CNN model, even in potentially complicated backgrounds. The model’s consistent performance across various background scenarios demonstrates its adaptability and resilience, both of which are required for efficient object detection on the HRSC2016 dataset.
Figure 14 presents a visual exploration of the performance metrics across various object detection models, comparing their parameter sizes with their
mAP percentages. The scatter plot reveals a diverse spectrum of parameter sizes, from a compact 31 M in models like LSKNet-S to a larger 55.1 M in models such as RoI-T. Notably, many models consistently remain around the 90%
mAP mark, suggesting that performance remains relatively stable across different model complexities. For instance, the Bayes-RCNN, with its 54.7M parameter size, stands out by achieving a 91.2%
mAP. This is especially remarkable when compared to models like RoI-T, which, despite its slightly larger parameter size, only manages an
mAP of 86.2%. These data underscore Bayes-RCNN’s optimal balance between efficiency and performance, making it a compelling choice for object detection tasks.
6.4. Image Super-Resolution on Remote Sensing Image Result Analysis
The image super-resolution technique can play a significant role in improving the quality of the remote sensing images with low resolution. Therefore, this study proposed a Bayesian LESRCNN to improve the quality of the DIOR dataset. The model was trained for 3000 epochs, and the batch size was set to 64 with a learning rate of 0.01. The proposed model achieved mean-psnr and mean-ssim of 32.56 and 0.9236 for the urban-100 dataset. The proposed model surpassed the accuracy of the baseline model for 2× scaling, which can be seen in
Table 6.
The Bayes R-CNN was further evaluated to compare the effect of integrating the image super-resolution technique prior to the prediction step.
Figure 15 shows the detection results extracted from three intricate scenarios, with outcomes depicted both prior and subsequent to the implementation of the image super-resolution technique.
The inference from
Figure 15 indicates a noticeable impact of the super-resolution technique on our model’s performance. An assessment of the results without the super-resolution technique shows a slight decrease in prediction accuracy. This decrement can be attributed to the lower resolution images, which potentially lack the essential granularity needed for the model to effectively distinguish and detect objects.
Conversely, following the introduction of the image super-resolution technique, there is a significant increase in prediction accuracy. The super-resolution technique enhances the resolution of input images, thereby providing more detailed information. The richer detail within these images improves the model’s ability to distinguish between different objects, consequently reducing false positives and negatives.
Specifically, the first image from
Figure 15b shows an increased number of correctly detected objects after applying the super-resolution technique, indicating a heightened recall rate. On the other hand, the second image exhibits a reduction in the false-positive rate following the implementation of the image super-resolution technique. This reduction demonstrates a marked improvement in precision. The third scenario illustrates the impact of superior image quality on object localization. The enhanced resolution of the images allows for more precise positioning of bounding boxes around detected objects, thereby improving the precision of detection.
6.5. Uncertainty-Aware Object Detection Result Analysis
Uncertainty estimation plays a vital role in evaluating a model’s prediction on the open set conditions. Uncertainty estimation can be divided into epistemic uncertainty and aleatoric uncertainty for computer vision problems. Epistemic uncertainty indicates the model’s uncertainty for the observable data, which is usually caused by inadequate training datasets. Therefore, the higher epistemic uncertainty directly correlates with the performance of the proposed model for a given image or in open-set conditions. On the other hand, aleatoric uncertainty indicates the uncertainty of a given image, including noise and variances. For further use, aleatoric uncertainty can be used to evaluate any dataset or image.
Figure 16 shows the uncertainty estimation of some predicted images. It can be seen that the epistemic uncertainty is lower in those predicted images where the model accurately detected objects in a given image. On the other hand, the epistemic increases in those images where the model prediction score is very low. It can also be seen that the aleatoric uncertainty increases for those images where the acquired image quality is not good due to the inherent noise. The proposed method can be used in real-world open-set conditions where a model can observe data that were not used to train the model. The prediction of those data can be evaluated through uncertainty measurement to make more robust predictions.
6.6. Background Classification
The DIOR and HRSC2016 datasets encompass a variety of remote sensing images, each providing diverse background information, thereby serving as a rich source for data extraction and analysis. These datasets have been classified into four distinctive categories based on their respective background environments: Airport, City, Sea, and Suburb. A total of 2713 images were randomly selected from the datasets and systematically assigned to these four categories.
In our study, we utilized a pre-trained policy from ImageNet for the auto augment technique to enhance the training dataset. The processed images with the auto augment technique can be seen in
Figure 17. This technique was applied to introduce diversity and improve the model’s robustness according to the dataset. The impact of this data augmentation is quantitatively demonstrated in
Table 7, which compares the overall classification accuracy of our model with and without data augmentation. Our model achieved a classification accuracy of 99.12% with auto augment and achieved 98.45% without any data augmentation technique. It demonstrates the effectiveness of using auto augment in enhancing the model’s performance by providing a more varied and comprehensive training dataset.
Table 8 shows the validation results of MRENet for the different classes. This study used a learning rate of 0.01 in conjunction with the SGD optimizer to train the model. The trained model was subjected to testing, wherein the evaluation metrics for classification were computed to assess the performance of the model. Our model demonstrated exceptional accuracy across all categories, thereby validating its efficacy.
Our proposed background information model was subsequently integrated with the prediction layer of the Bayes R-CNN to generate object detection with background information.
Figure 18 illustrates the outcome of retrieving background information in conjunction with the object detection methodology. We selected four images representative of two distinct background classes to evaluate the predictive prowess of the proposed method. As depicted, the method exhibits impressive accuracy in predicting the respective background classes. This innovative approach holds substantial potential for future research, as it facilitates the extraction of an enriched array of information from remote sensing imagery.
7. Discussion
This study utilized two widely recognized state-of-the-art datasets, DIOR and HRSC2016, which are extensively used in the remote sensing community to benchmark object detection models. The DIOR dataset comprises high-resolution images with a diverse set of 20 object categories, providing a broad spectrum of scenarios and environmental conditions. This diversity is crucial for evaluating the robustness and generalizability of object detection models. The HRSC2016 dataset, on the other hand, focuses specifically on high-resolution images of ships which offer detailed annotations and challenging scenarios such as varying scales, orientations, and complex backgrounds. To establish the sufficiency of these datasets for training and testing our model, we compared the performance of our model with other state of the art models.
Table 9 compares the
mAP attained with the proposed Bayes R-CNN model to that of existing state-of-the-art models on the DIOR dataset. In this comparison study, the Bayes R-CNN model outperforms all others, obtaining the highest
mAP score of 74.6. Other models, including MSF-SNET, DFPN-YOLO, ASSD, and AFPN + GAS, obtained
mAP scores of 66.5, 69.3, 71.1, and 73.3, respectively. Even when compared to approaches such as A-MLFFM, ViT-G12X4, and FPN + MSDAM + MLFAM, which produced reasonably high
mAP scores in the range of 73.6 to 73.9, the Bayes R-CNN model outperforms them. This result demonstrates the Bayes R-CNN model’s efficacy for object detection tasks in the DIOR dataset.
Table 10 provides a comparative analysis of the
mAP performance of the Bayes R-CNN model compared to other state-of-the-art models on the HRSC2016 dataset. The Bayes R-CNN model outperformed the other models, obtaining the highest
mAP of 91.21, according to the table. When compared to other models with lesser
mAP scores ranging from 78.51 to 88.20, such as TOSO, Gliding Vertex, RSDet, and DAFNe, the Bayes R-CNN shows a considerable improvement. Even when compared to high-performing models like CSL, R3Det, LSKNet-S, and OFCOS, which had
mAP scores ranging from 89.62 to 91.07, the Bayes R-CNN outperformed them. These findings highlight the Bayes R-CNN model’s efficacy for object detection tasks on the HRSC2016 dataset.
During the evaluation of our background classification model, we observed several instances of misclassification that provide insight into the challenges faced by the model. Specifically, we identified one airport image that was incorrectly classified as a suburb, two suburb images that were mislabeled as a city, and one suburb image that was mistakenly categorized as an airport, as shown in
Figure 19. These errors can be attributed to several factors related to the visual similarities between these classes and the model’s feature extraction process.
The airport image misclassified as a suburb resulted from the presence of vast open spaces near the airport, which confused the model. Similarly, the two suburb images were labeled as a city due to the higher density of buildings and urban-like features present within these suburbs which causes the model to categorize them as city environments. Furthermore, the suburb image misclassified as an airport likely resulted from the presence of airport-like structures that the model associated with airport facilities. This indicates future research can be conducted to refine feature extraction techniques that can better differentiate between such visually similar environments.
Although our proposed method achieved better performance, one notable limitation of our model is its computational complexity. The integration of Bayesian techniques such as BBB and BDLAM increases both the computational load and the time required for training compared to traditional deep learning models. This can be a significant challenge for real-time applications and necessitates access to high-performance computing resources. Additionally, while our model effectively quantifies uncertainty and mitigates overfitting, it still faces challenges in distinguishing objects with highly similar visual features leading to occasional misclassifications. Furthermore, our model’s dependency on high-resolution images to achieve optimal performance presents another limitation. In practical applications, such high-quality data may not always be available due to constraints in sensor capabilities or transmission bandwidth. Although we have incorporated a Bayesian image super-resolution technique to address this issue, ensuring consistent and accurate object detection across various datasets remains a challenge.
Future research could focus on enhancing the computational efficiency of our model, possibly through techniques such as model pruning and more efficient variational inference methods. Additionally, implementing advanced data augmentation strategies and exploring semi-supervised learning approaches could improve the model’s generalization capabilities and reduce its dependency on high-resolution images. We also see significant potential in extending our model to process video sequences in remote sensing, leveraging temporal information to improve detection accuracy and robustness.