Region-Focusing Data Augmentation via Salient Region Activation and Bitplane Recombination for Target Detection

Zhang, Huan; Han, Xiaolin; Sun, Weidong

doi:10.3390/rs16244806

Open AccessTechnical Note

Region-Focusing Data Augmentation via Salient Region Activation and Bitplane Recombination for Target Detection

by

Huan Zhang

¹

,

Xiaolin Han

²

and

Weidong Sun

^1,*

¹

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

²

School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(24), 4806; https://doi.org/10.3390/rs16244806

Submission received: 26 October 2024 / Revised: 17 December 2024 / Accepted: 20 December 2024 / Published: 23 December 2024

(This article belongs to the Special Issue Imagery Classification and Feature Extraction Based on Hyperspectral Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

As the performance of a convolutional neural network is logarithmically proportional to the amount of training data, data augmentation has attracted increasing attention in recent years. Although the current data augmentation methods are efficient because they force the network to learn multiple parts of a given training image through occlusion or re-editing, most of them can damage the internal structures of targets and ultimately affect the results of subsequent application tasks. To this end, region-focusing data augmentation via salient region activation and bitplane recombination for the target detection of optical satellite images is proposed in this paper to solve the problem of internal structure loss in data augmentation. More specifically, to boost the utilization of the positive regions and typical negative regions, a new surroundedness-based strategy for salient region activation is proposed, through which new samples with meaningful focusing regions can be generated. And to generate new samples of the focusing regions, a region-based strategy for bitplane recombination is also proposed, through which internal structures of the focusing regions can be reserved. Thus, a multiplied effect of data augmentation by the two strategies can be achieved. In addition, this is the first time that data augmentation has been examined from the perspective of meaningful focusing regions, rather than the whole sample image. Experiments on target detection with public datasets have demonstrated the effectiveness of this proposed method, especially for small targets.

Keywords:

region-focusing; data augmentation; salient region activation; bitplane recombination; target detection

1. Introduction

With the rise of Chat-GPT and its variants, data-driven large foundation models have attracted increasing attention. A recent study has shown that the performance of a convolutional neural network (CNN) is logarithmically proportional to the amount of training data [1]. But in most cases, the training data are difficult to collect or expensive to annotate, especially in the field of satellite remote sensing, such as the landslide data [2], disaster emergency data [3] and hyperspectral image [4,5,6]. It is generally acknowledged that, without sufficient annotated training data, over-parameterized CNNs would be over-fitted.

To solve the over-fitting problem, data augmentation has attracted increasing attention in recent years. According to the actual effect on training, data augmentation methods can be roughly categorized into two groups: positive data augmentation and negative data augmentation. The former is dedicated to increasing new augmented data through the finite or conscious distribution shift of the original data, and the latter is dedicated to increasing the robustness of the model through increasing data complexity. More specifically, in the group of positive data augmentation, geometric data augmentation methods, including cropping, flipping [7] and rotation [8], multiply the geometric variation in the sample images. Color jitter [9] changes the contrast, saturation and brightness of the sample images. Additionally, the bitplane recombination method [10] gives an idea that there are certain differences between different bitplanes, and the recombination of them can provide different structural information at different levels. In addition, metalantis [11] enhances underwater images through metamergence, metalief and metaebb phases. Inconsistent knowledge distillation (IKD) [12] utilized sample-specific data augmentation to capture distinct frequency components and adversarial feature augmentation to extract the non-robust features. The above methods can really boost the utilization of the original data with finite distribution shift, but attention has not been paid to the influence of sub-regions in the sample images, especially those that may contain positive or typical negative regions. Apart from these, mixup [13] forces CNNs to learn convex combinations between different sample images to generate new augmented samples, maximizing the margin by pushing the decision boundary away from the original samples. Manifold mixup [14] mixes the features from the intermediate layers of CNN to generate new augmented samples with feature diversity. These methods have also boosted the utilization of the original data through conscious distribution shift, but the newly generated samples are usually unrealistic, unreliable and inconsistent with human perception.

On the other hand, in the category of negative data augmentation, GANs have been successfully introduced to generate new samples, including BAGAN [15], DiffAugment [16] and the APA method [17]. However, most of them may be unfeasible for remotely sensed images or medical images, as the newly generated samples are unrealistic and unreliable. In addition, Cutout [18] randomly replaces a certain part of a sample image with a square patch of constant value, to prevent CNN from over-fitting. Random erasing [19] pushes CNN to learn occlusion by randomly determining whether to mask a region or not. Hide-and-Seek randomly hides image patches to force CNN to learn multiple parts of the original image [20]. GridMask [21] masks multiple regions in evenly spaced grids to maintain a good balance between the deletion and retention of critical information. CutMix [7] cuts a certain region in the sample image and replaces it with others, to differentiate two classes within a single image. RICAP [22] crops and patches four sample images to generate new samples. Even though these methods concentrate on forcing CNN to learn multiple parts of the original image by occlusion or re-editing operations, they actually damage the internal structures of a given sample image, and ultimately affect the results of subsequent application tasks. Taking the target detection in remotely sensed images as an example, a target may only consist of a few pixels, while there are thousands of small targets in one image. Most of these methods are inapplicable for this situation, as they may completely occlude or crop off such small targets. Furthermore, most of these methods concentrated only on the whole sample image, rather than the meaningful sub-regions inside the sample image.

To this end, this paper proposes Region-fOcusing data augmentation via salient region activation and BITplane recombination for target detection, termed ROBIT, to solve the problem of internal structure loss in the current data augmentation methods. More specifically, to boost the utilization of the positive regions and typical negative regions, a new surroundedness-based strategy for the salient region activation is proposed. And to generate new samples of the focusing regions, a region-based strategy for the bitplane recombination is proposed. Experiments on target detection tasks demonstrated the effectiveness of this method. The novelties and contributions of the proposed method can be summarized as follows:

To the best of our knowledge, this is the first time that data augmentation has been examined from the perspective of meaningful focusing regions, rather than the whole sample image, and both positive regions and typical negative regions have been considered at the same time.
A region-based strategy for bitplane recombination is proposed, which can maintain the internal structures of the focusing regions. And through a combination with the region-focusing strategy, a multiplied rate of data augmentation can be achieved.

2. Related Works

2.1. Saliency Detection

In recent years, convolutional neural networks (CNNs) have succeeded in saliency detection; they are generally utilized to construct encoders and decoders for feature extraction and generate visual saliency maps [23]. Subsequently, Long-Short Term Memory (LSTM) networks were also applied to capture both local and long-range visual features, to boost the accuracy of saliency detection [24]. Recently, by learning spatial long-range dependencies, transformer-based networks have achieved significant improvements in saliency detection [25]. However, most of these methods are designed for singular visual contexts, which cannot be further adapted to different tasks [26].

2.2. Bitplane Techniques

On the other hand, to exploit the internal structural information of the sample image, we should pay attention to the bitplane techniques. Khan et al. combined fuzzy logic with bitplane to locate and segment the regions of interest (RoI) in Computed Tomography (CT) images [27]. Dubey et al. adopted the local bitplane decoded pattern (LBDP) to retrieve CT images [28]. Tuan et al. utilized the adaptive fast marching and bitplane to segment the brain for the Magnetic Resonance Images (MRI) [29]. To date, bitplane techniques have attracted attention mostly from medical image processing, rather than from deep-learning-related tasks, especially data augmentation or target detection tasks.

3. Methods

This proposed region-focusing data augmentation method mainly consists of the following three parts: empirical risk of data augmentation, region-focusing based on salient region activation and region-based data augmentation.

3.1. Empirical Risk of Data Augmentation

Let

X

and

Y

denote the data space and label space, respectively, and each sample

x

with label

y

can be denoted by

(x, y)

, and a joint distribution of these data can be denoted by

D (x, y)

. The learning task aims to find a function

f : x \to y

, which minimizes the pre-defined loss

l (f (x), y)

over the distribution

D

, and this is known as the expected risk

R (f | D)

defined by

R (f| D) = \int l (f (x), y) d D (x, y)

(1)

However, the joint distribution of training data is usually unknown in practice. Thus, a general solution is empirical risk minimization (ERM) [30], which optimizes the empirical risk of the training dataset

{(x_{i}, y_{i})}_{i = 1}^{N}

to mimic the data distribution:

\hat{R} (f| D) = \frac{1}{N} \sum_{i = 1}^{N} l (f (x_{i}), y_{i})

(2)

Here, from Equation (2), we can see that the expected risk

\hat{R} (f| D)

increases with the amount of training data. When conducting data augmentation, new samples

{(\tilde{x_{i}}, \tilde{y_{i}})}_{i = 1}^{N^{'}}

can be generated by transformation, occlusion or re-editing on

{(x_{i}, y_{i})}

. Since most data augmentation methods will not change the annotated labels, newly generated samples

{(\tilde{x_{i}}, \tilde{y_{i}})}

will share the same labels with

{(x_{i}, y_{i})}

, i.e.,

\tilde{y_{i}} = y_{i}

. Data augmentation converts the distribution

D

to a new distribution

D^{'}

, and the learning task is changed from minimizing the expected risk

\hat{R} (f| D)

to minimizing

\hat{R} (f| D^{'})

:

\hat{R} (f| D^{'}) = \frac{1}{N^{'}} \sum_{i = 1}^{N^{'}} l (f (\tilde{x_{i}}), \tilde{y_{i}})

(3)

where

N^{'}

denotes the total scale of new augmented dataset,

N^{'} \geq N

. According to the Vapnik–Chervonenkis Dimension theory (VC theory) [30], given a binary classifier

f \in F

with the finite VC-Dimension

{| F |}_{V C}

, the upper bound of the expected risk can be formulated as follows with the probability

1 - δ

:

R (f| D) \leq \hat{R} (f| D) + O ({(\frac{{| F |}_{V C} - l o g δ}{N})}^{α})

(4)

where

\frac{1}{2} \leq α \leq 1

,

α = \frac{1}{2}

is the non-separable case and

α = 1

is the separable case [31]. Similarly, the model trained on the new augmented dataset can be expressed as

R (f| D^{'}) \leq \hat{R} (f| D^{'}) + O ({(\frac{{| F |}_{V C} - l o g δ}{N^{'}})}^{α})

(5)

Although the generalization error

O

in Equation (5) is smaller than that in Equation (4), there is discrepancy between the risk terms due to the distribution. Assuming

D \subseteq D^{'}

, it can be expressed as

R (f| D^{'}) - R (f| D) = ϵ_{1} \geq 0

(6)

Thus, Equation (5) can be reformulated as

R (f| D) \leq \hat{R} (f| D^{'}) + ϵ_{1} + O ({(\frac{{| F |}_{V C} - l o g δ}{N^{'}})}^{α})

(7)

From Equation (7), it can be seen that the benefits of data augmentation can be attributed to (1) the distribution shift, i.e.,

ϵ_{1}

, being fractional; (2) the scale of new augmented dataset being large, i.e.,

N^{'} \geq N

.

Actually, there is a certain trade-off between the amount of new augmented data and the distribution shift. And the distribution shift between the original dataset and new augmented dataset can be expressed as

\hat{R} (f| D^{'}) - \hat{R} (f| D) = ϵ_{2} \geq 0

(8)

Combined with Equations (2) and (3), we have

ϵ_{2} = \frac{1}{N} \sum_{i = 1}^{N} l (f (x_{i}), y_{i}) - \frac{1}{N^{'}} \sum_{i = 1}^{N^{'}} l (f (\tilde{x_{i}}), \tilde{y_{i}})

(9)

On the other hand, there are a series of data augmentation methods that generate new augmented images by occlusion or re-editing of the original images, such as Cutout [18], RE [19], HaS [20], etc. In these augmentation methods, the relationship between the augmented image and the original training image is one-to-one, i.e., each augmented image could find a corresponding original training image. Therefore, in these augmentation methods,

N

could be replaced by

N^{'}

:

ϵ_{2} = \frac{1}{N^{'}} \sum_{i = 1}^{N^{'}} l (f (x_{i}), y_{i}) - l (f (\tilde{x_{i}}), y_{i})

(10)

Then, the objective function can be applied with Taylor expansion:

l (f (\tilde{x_{i}}), y_{i}) \geq l (f (x_{i}), y_{i}) + \nabla {l (f (\tilde{x_{i}}), y_{i})}^{T} ({\tilde{x_{i}} - x}_{i})

(11)

where

\nabla

denotes differential operation and

T

denotes transpose operation. Combined with Equation (10), we have

ϵ_{2} = \frac{1}{N^{'}} \sum_{i = 1}^{N^{'}} \nabla {l (f (\tilde{x_{i}}), y_{i})}^{T} ({\tilde{x_{i}} - x}_{i})

(12)

where the term

({\tilde{x_{i}} - x}_{i})

corresponds to the internal structure loss in the data augmentation methods, leading to the distribution shift. Meanwhile for the latest data augmentation methods, like BIRD [10], as the transformation is conducted on certain bitplanes, the continuity of the internal structure of the training images can be maintained without changing the distribution of the original dataset.

Inspired by these methods, here, a region-focusing data augmentation method via salient region activation and bitplane recombination is proposed. More specifically, to boost the utilization of the positive regions and typical negative regions, a new surroundedness-based strategy for salient region activation is proposed. And to generate new samples of the focusing regions, a region-based strategy for bitplane recombination is proposed, without changing the distribution of the original dataset.

3.2. Region-Focusing Based on Salient Region Activation

Most of the current data augmentation methods are based on occlusion or re-editing operations; even if they concentrate on occlusion robust or forcing the CNN to learn multiple parts of the original sample image, these methods still damage the continuity of internal structures of a given sample image. As a result, most of the current data augmentation methods may be inapplicable if we want to take the detection of small targets shown as in Figure 1a as our final application task. Intuitively, the occlusion or re-editing operations in those methods should be more conservative. Based on the above considerations, here, a new surroundedness-based strategy is proposed to focus the data augmentation operations only on some meaningful sub-regions rather than the whole sample image. In more detail, the meaningful sub-regions are extracted by the subsequent salient region activation, and then bitplane recombination is conducted for the extracted sub-regions.

It should be mentioned that, as the meaningful sub-regions may or may not contain the annotated target, those sub-regions containing the target can be taken as positive regions, while those sub-regions without the target can be taken as typical negative regions, as shown in Figure 2. It can be seen that, if only positive regions are selected to be augmented, the network would come across the long-tail problem. In contrast, our proposed surroundedness-based strategy with both positive and negative regions could alleviate this problem effectively, because both positive regions and typical negative regions can contribute to the network training, through the following data augmentation operation. A comparison between the use of all meaningful focusing regions and only the positive regions will be given and discussed in the Experiment section.

To begin with, a single channel of a

C

-channels image can be taken as an 8-bit grayscale image,

P_{c} \in R^{H \times W \times 1}

. Our proposed salient region activation method is to select a meaningful focusing region

P_{e}

. To select meaningful focusing regions, a morphological-based approach, i.e., surroundedness, is introduced, which has demonstrated its effectiveness on sensing figure-ground assignments in neuroscience findings [32]. The surroundedness refers to the enclosure topological relationship between diverse visual components, which is invariant in different scenarios, independent of the shape or scale of visual components. Generally, to analyze surroundedness, a given image

P_{o r i}

can be characterized by a set of binary images

{B_{r}}

, which are sampled from a pre-given distribution function. Then, to depict the effects of each binary image

B_{r}

, which can be termed as salient map

Ψ (B_{r})

, surroundedness analysis and normalization are conducted on each binary image. Finally, to acquire meaningful focusing regions, the salient maps

{Ψ (B_{r})}

are averaged and thresholded, as shown in Figure 1.

Concretely, a set of binary images

{B_{r}}

can be generated by randomly thresholding the feature maps of the given sample image

P_{o r i}

, through the pre-defined threshold

θ_{c}

of each channel of the feature maps:

p_{i, j} = Γ (P_{c}, θ_{c}), θ_{c} ~ d_{θ}

(13)

where each individual channel image

P_{c}

is selected as a feature map;

p_{i, j}

represents the value of pixel

(i, j)

on the binary image, i.e., pixel value

p_{i, j} = 1

if

P_{c, i, j} > θ_{c}

and

p_{i, j} = 0

otherwise;

d_{θ}

represents the prior distribution of threshold sampling on the feature map; and

θ_{c}

follows the distribution of

d_{θ}

.

To further simplify the sampling of a multi-channel image

P_{o r i}

, the color space can be whitened before the binary image sampling. Then, the threshold

θ_{c}

can be sampled at a fixed step

δ

, which is set to 8 in the experiments. In the limit, this fixed-step sampling is equivalent to uniform sampling [32].

To acquire the salient map

Ψ (B_{r})

, a binary image

B_{r}

can be found firstly by analyzing the surrounding areas to acquire the activation map

A (B_{r})

, and then by normalizing the activation map to emphasize regions of small targets. Specifically, the pixels on a binary image

B_{r}

can be divided into the figure set

B_{r, f} : {p_{i, j} = 1}

and the ground set

B_{r, g} : {p_{i, j} = 0}

. Intuitively, a figure pixel is surrounded if it is enclosed by the ground set. And the surroundedness can be defined by the connectivity of a pixel to the image boundary pixels. On a binary image

B_{r}

, a pixel

p_{i, j}

is surrounded if there is no path in

B_{r, f}

or

B_{r, g}

that connects

p_{i, j}

and any pixel of the image borders, where a path denotes a sequence of pixels in which any pair of consecutive pixels are adjacent. Therefore, by using the pixels on the image boundary as seeds, non-surrounded pixels can be filtered out by the Flood Fill algorithm [32] efficiently. Thus, on the activation map

A (B_{r})

, the surrounded pixels are set to 1 and the others are set to 0. Then, to normalize the activation map

A (B_{r})

, each activation map is divided into two sub-activation maps:

A_{+} (B_{r}) = A (B_{r}) \land B_{r}, A_{-} (B_{r}) = A (B_{r}) \land {\neg B}_{r}

(14)

where

\land

is a pixel-wise Boolean conjunction operation and

\neg

is a negation sign.

A_{+} (B_{r})

and

A_{-} (B_{r})

denote the selected surrounded regions on

B_{r}

and

{\neg B}_{r}

, respectively. The intuitive explanation of these sub-activation maps is that

A_{+} (B_{r})

activates the surrounded regions greater than the threshold, and

A_{-} (B_{r})

activates the surrounded regions less than the threshold. Therefore, normalization can be taken to emphasize the rare topographic features. To further penalize sub-activation maps for small targets, the sub-activation maps are dilated before normalization:

A_{+} (B_{r}) = A_{+} (B_{r}) ⨁ K, A_{-} (B_{r}) = A_{-} (B_{r}) ⨁ K

(15)

where

⨁

is dilate operation and

K

is a square dilation kernel of width

w

. After that, to emphasize small active regions,

l_{2}

-normalization is applied to the sub-activation maps:

Ψ (B_{r}) = \frac{A_{+} (B_{r})}{{| | A_{+} (B_{r}) | |}_{2}} + \frac{A_{-} (B_{r})}{{| | A_{-} (B_{r}) | |}_{2}}

(16)

Finally, to acquire an averaged salient map

\bar{Ψ (B_{r})}

, all salient maps are averaged and small targets are emphasized, as shown in Figure 2.

The overall surroundedness-based strategy for salient region activation is simple, efficient and training-free. As a result, the center point

v_{e} = (i_{e}, j_{e})

and its size

(h_{e}, w_{e})

of each meaningful focusing region

P_{e}

can be acquired as in Figure 1d, for the next part: the region-based data augmentation.

3.3. Region-Based Data Augmentation

After acquiring the meaningful focusing regions, to generate new samples of the focusing regions, a region-based strategy of bitplane recombination is conducted by which the internal structures of focusing regions can be reserved.

Here, firstly, to utilize the symbolic system, the basic Algorithm 1 of bitplane recombination will be briefly introduced. Given a typical 8-bit

C

-channel image

P_{o r i} \in R^{H \times W \times C}

, it will be firstly split into

C

-separated channels

P_{c}

:

P_{o r i} = P_{1} ⨁ P_{2} ⨁ P_{3} \dots P_{c}

(17)

where

P_{c} \in R^{H \times W \times 1}

, each channel can be taken as an 8-bit grayscale image, and each grayscale image can be sliced into 8 bitplanes hierarchically, from the

0^{t h}

bitplane to the

7^{t h}

bitplane as follows:

P_{c} = \sum_{m = 0}^{7} \sum_{i = 1}^{H} \sum_{j = 1}^{W} b_{i, j}^{m} 2^{m} \propto {P_{c}^{m}}_{m = [0,7]}

(18)

where

b_{i, j}^{m}

represents the bits located at pixel

(i, j)

on the

m^{t h}

bitplane (marked as

m^{t h}

BP). Each bitplane

P_{c}^{m} \in R^{H \times W \times 1}

reserves its own bits, the

m^{t h}

BP, taking no account of the bits on other bitplanes. For an 8-bit grayscale image, bitplanes from the

0^{t h}

BP to the

7^{t h}

BP provide internal structure information from the details to the entirety.

After bitplane slicing, to take advantage of the bitplane properties in region-based data augmentation, certain numbers of bitplanes will be extracted to maintain the internal structures of meaningful focusing regions. In detail, the extracted bitplanes can be formulated as

P^{s} = \sum_{c = 1}^{C} \sum_{s = l_{1}}^{l_{k}} \sum_{i = 1}^{H} \sum_{j = 1}^{W} b_{c, i, j}^{s} 2^{s}, s = {l_{1}, \dots, l_{k}}

(19)

where

l_{1}, \dots, l_{k}

denotes the order of the extracted bitplanes, i.e.,

l_{k}^{t h}

BP.

The appropriate number of extracted bitplanes depends on the difference in extracted bitplanes over the recombined new image. The recombined new image

P_{n e w}^{l_{1}, \dots, l_{k}}

generated from the extracted bitplanes is utilized to estimate the original image by using the maximum a posteriori (MAP) theory [33], and the error probability between the original image and the estimated image, i.e.,

P_{o r i}

and

{\tilde{P}}_{o r i}

, is used to evaluate the difference in the number of extracted bitplanes as follows:

{\tilde{P}}_{o r i} ≅ a r g m a x \log p (P_{o r i}, P_{n e w}^{l_{1}, \dots, l_{k}})

(20)

For a given 8-bit image, the averaged error probability is large when 1 or 2 bitplanes are extracted. However, when 3, 4, 5, 6 and 7 bitplanes are extracted, the averaged error probability is acceptable. Thus, the suitable number of extracted bitplanes is in the range of 3 BPs~7 BPs.

Since the recombined images

P_{n e w}^{l_{1}, \dots, l_{k}}

can be regarded as the reconstructed images of the original image

P_{o r i}

, Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) indicators are selected to measure the recombination quality. As a result, the MSE of

P_{n e w}^{0,1, 2,3, 4,5, 6}

and

P_{n e w}^{0,1, 2,3, 4,5, 7}

is quite large. In addition, the PSNR of

P_{n e w}^{0,1, 2,3, 4,5, 6}

and

P_{n e w}^{0,1, 2,3, 4,5, 7}

is so fractional that it is better to reserve both the

7^{t h}

and the

6^{t h}

bitplane. Furthermore, the SSIM indicates that reserving both the

7^{t h}

and the

6^{t h}

bitplane contributes to maintaining the major internal structures.

Therefore, from the view of recombination quality, bitplane extraction (BE) operations preserving both the

7^{t h}

and the

6^{t h}

bitplane can be adopted to maintain the internal structural information. In addition, adjacent bitplane extraction [10] is selected as the basic strategy for the BE operation.

For a focused region on a single channel image

P_{c}

, the bits within the selected focusing region

P_{e}

on the extracted bitplanes are reserved according to Equation (9):

P_{e}^{'} = \sum_{s = l_{1}}^{l_{k}} \sum_{i = i_{e}}^{i_{e} + h_{e}} \sum_{j = j_{e}}^{j_{e} + w_{e}} b_{i, j}^{s} 2^{s}

(21)

For the

C

-channels image

P_{o r i} \in R^{H \times W \times C}

, there would be 4 representative sub-policies

{Ω_{i}}_{i = [1,4]}

for the region-based data augmentation, taking

C = 3

as an example:

The meaningful focusing regions are all selected, and the extracted bitplanes of different regions are the same order bitplanes, as shown in Figure 3c:

$Ω_{1} : \{P_{e 1}\} = \{P_{e 2}\} = \{P_{e 3}\}, \{s_{1}\} = \{s_{2}\} = {s_{3}}$

(22)
The meaningful focusing regions are partially selected, but the extracted bitplanes of different regions are the same order bitplanes, as shown in Figure 3d:

$Ω_{2} : \{P_{e 1}\} \neq \{P_{e 2}\} \neq \{P_{e 3}\}, \{s_{1}\} = \{s_{2}\} = {s_{3}}$

(23)

It should be mentioned that, for a given training image, the partially selected focusing regions change by a pre-given percentage in each training epoch.

3.: The meaningful focusing regions are all selected, but the extracted bitplanes of different regions are different order bitplanes, as shown in Figure 3e:

$Ω_{3} : \{P_{e 1}\} = \{P_{e 2}\} = \{P_{e 3}\}, \{s_{1}\} \neq \{s_{2}\} \neq {s_{3}}$

(24)

Here, it should also be mentioned that the different order bitplanes is not the different individual bitplanes contained in a bitplane extraction operation, but the different bitplane extraction operations. Taking the 6-times 5-BPs as an example, whose adjacent bitplane extraction operations include

P_{n e w}^{3,4, 5,6, 7}

,

P_{n e w}^{0,4, 5,6, 7}

,

P_{n e w}^{2,3, 5,6, 7}

,

P_{n e w}^{1,2, 5,6, 7}

,

P_{n e w}^{0,1, 5,6, 7}

and

P_{n e w}^{0,1, 2,6, 7}

, here, different order bitplanes refer to

s = {3,4, 5,6, 7}

,

s = {0,4, 5,6, 7}

or

s = {2,3, 5,6, 7}

, etc.

4.: The meaningful focusing regions are partially selected, and the extracted bitplanes of different regions are different order bitplanes, as shown in Figure 3f:

$Ω_{4} : \{P_{e 1}\} \neq \{P_{e 2}\} \neq \{P_{e 3}\}, \{s_{1}\} \neq \{s_{2}\} \neq {s_{3}}$

(25)

By applying these region-focused bitplane recombination operations, the positive regions and typical negative regions within the meaningful focusing regions of the given sample images can be exploited effectively. In these region-focused bitplane recombination operations, sub-polices

Ω_{2}

emphasize the spatial variation, while sub-polices

Ω_{3}

emphasize the bitplane variation. And sub-polices

Ω_{4}

can combine the variation in them both. The performances of these sub-policies are shown in Figure 3.

3.4. Total Framework of the Proposed Method

After salient region activation and region-based bitplane extraction, the extracted bitplanes within meaningful focusing regions will be recombined to generate a new sample of training image

P_{n e w}^{l_{1}, \dots, l_{k}} \in R^{H \times W \times C^{'}}

, as follows:

P_{n e w}^{l_{1}, \dots, l_{k}} = \neg {P_{e}} ⨁ P_{e 1}^{'} ⨁ P_{e 2}^{'} \dots P_{e n}^{'}

(26)

It can be seen from Equation (26) that, in the meaningful focusing regions of the given sample image, the internal structures can be completely maintained, and the variety among different bitplanes can also be full expressed, through the recombination of extracted bitplanes.

We can see that, given a selected region

P_{e}

in the sample image

P_{o r i}

, compared with the occlusion- or re-editing-based data augmentation methods, our proposed method can reduce the amount of internal structure loss by pixels from

P_{e}

down to

P_{e} \times p \times \frac{b^{r} 2^{r}}{255}

, where

r \neq {l_{1}, \dots, l_{k}}

.

The overall framework of our proposed ROBIT method can be summarized as follows, as shown in Figure 4.

Algorithm 1. ROBIT data augmentation

Input: Training dataset

\{P_{o r i}\} = {P_{1}, P_{2}, \dots, P_{N}}

.
Parameter setting:

N_{b}

, number of image blocks of image

P_{i}

;

p

, the probability threshold of the data augmentation.
Output: Augmented set

\{P_{n e w}\} = {\{P_{n e w, 1}\}, \dots, \{P_{n e w, N}\}}

.
For

i = 1

to

N

do
Crop

P_{i}

into image blocks

{{P}_{i, j}}_{j = [1, N_{b}]}

.
For j = 1 to

N_{b}

do
For each epoch do
While averaged focusing map is generated
Extract focusing regions

{P_{e}}

;
In each focusing region do
Randomly generate the probability of data augmentation

τ

from

(0,1)

.
If

τ \leq p

do
Region focusing based on salient region activation;
Region-focused Bitplane extraction as in Equation (21);
Bitplane recombination to generate new images as in Equation (26).
End
End
End
End
End
End

4. Results

To evaluate the effectiveness of our proposed method, experiments were conducted on the object detection task using two different public datasets.

4.1. Dataset

For object detection task, our proposed method was evaluated on the public remotely sensed image dataset, the DOTA dataset. The DOTA dataset [8] is the most popular aerial remote sensing object detection dataset, containing 2806 images, 15 categories and 188,282 instances of different scales, orientations and shapes. This dataset is also representative in small target detection.

In addition, experiments are also conducted on the HRSC2016 dataset [34], which is designed for the ship detection of remotely sensed images. HRSC2016 is collected from Google Earth, which includes 1061 images ranging from 300 × 300 to 1500 × 900. This dataset contains 436 training images, 181 validation images and 444 test images.

4.2. Experimental Settings of Our Proposed Method

In all the following experiments, the data augmentation rate of the bitplane recombination is set to six times, which is defined by the scale of the new generated dataset relative to that of the original dataset. In addition, the percentage of the region selection in region-focusing is set to 0.5 and the training epoch is set to 12 for our proposed method. Since the final data augmentation rate of our proposed ROBIT method is equal to the number of selected focusing regions times the rate of region-based data augmentation, the final data augmentation rate using sub-policy

Ω_{1}

is

1 \times 6 = 6

times, the final rate using sub-policy

Ω_{2}

is

0.5 \times 12 \times 6 = 36

times, the final rate using sub-policy

Ω_{3}

is

1 \times 6 = 6

times and the final rate by using sub-policy

Ω_{4}

is

0.5 \times 12 \times 6 = 36

times, respectively.

4.3. Comparison with Different Object Detection Methods

In this section, the target detection results of the RoI Transformer (RT) [35] method with and without our proposed method are compared with other popular target detection methods on the DOTA dataset, to validate the effectiveness of our proposed method. The RT method is the official baseline method of the DOTA dataset, and the compared methods include RRPN [36], DCN [37], ICN [38] and FFA-3 [39], which are influential works in the target detection of remotely sensed images. In this experiment, according the settings of RT [35], 25% training images are augmented, the learning rate is set to 0.0005 and the moment is set to 0.9. And the mean average precision (mAP) on the DOTA server is utilized to evaluate the detection accuracy.

By using all the original and augmented images, the mAP of the 15 categories is shown in Table 1. From left to right, the classes are plane, baseball diamond, bridge, ground track field, small vehicle, large vehicle, ship, tennis court, basketball court, storage tank, soccer ball field, roundabout, harbor, swimming pool and helicopter. The best results are shown in bold and the second-best results are shown with an underline. As can be seen in Table 1, the mAP can be improved by 2.9%, compared with the latest Pytorch version of the RT method (RT-Pytorch). And our proposed method can even enhance the mAP by 7.1%, compared with the Mxnet version RT method (RT-Mxnet). It is also obvious that, while there are subtle discrepancies in the accuracy of different categories, our proposed method can boost most categories, especially the categories with small objects, such as the helicopter category (HC).

4.4. Comparison with Different Data Augmentation Methods

In this section, data augmentation methods for target detection are compared on the DOTA dataset, including data augmentation using our proposed method and those using traditional “Crop and Rotation (C&R)” [35], “Crop and Scaling (C&S)” and the latest BIRD method [10]. In addition, we also implemented a random digging-based BIRD data augmentation method (noted as rBIRD) as a comparison. To make a fair comparison, all the input images were cropped to the same size. According to the detection accuracy, the official baseline method of the DOTA dataset, i.e., RT-Pytorch, was chosen as the benchmark target detection method. And the detection accuracy was evaluated by the mAP as well. From Table 2, it can be seen that, compared with the baseline RT-Pytorch method, the mAP can be enhanced by 2.9% by using our proposed method. Moreover, our proposed ROBIT method also surpasses most of the other comparative methods and the random digging-based BIRD data augmentation method across different categories, especially the categories with small targets, such as the helicopter (HC) category.

Given a

C

-channel original image

P_{o r i} \in R^{H \times W \times C}

with the height and width of a meaningful focusing region

P_{e}

denoted as

h_{e}

and

w_{e}

, the computational complexity of single-stream ROBIT data augmentation is

O (n \times h_{e} \times w_{e})

; here,

n

denotes the number of meaning focusing regions. Therefore, the computational complexity of a given training image

O (n \times h_{e} \times w_{e})

is approximate to

O (H \times W)

. And the computational complexity of the entire training dataset is

O (N \times H \times W)

, which is similar to that of the official data augmentation of the DOTA dataset, i.e., C&R and C&S.

4.5. Effectiveness of the Sub-Policies in Region-Based Operations

In this section, to investigate the effectiveness of the sub-policies of region-based data augmentation operations for the

C

-channel sample images, target detection experiments are carried out on the DOTA dataset. RT-Pytorch is chosen as the benchmark detection method as well, and the mAP is applied to verify the effectiveness of different sub-policies. The bitplane extraction operation is selected as 7-BPs.

From Table 3, it can be seen that both the sub-policy

Ω_{2}

that partial regions selected with same order bitplanes extracted and the sub-policy

Ω_{3}

that all regions selected with different order bitplanes extracted boost the target detection accuracy remarkably. What is more, the sub-policy that partial regions selected with different order bitplanes extracted, i.e., the

Ω_{4}

sub-policy, can combine the advantages of them both and achieves the best performance.

4.6. Ablation Study

In this section, to validate the effectiveness of each part of our proposed method, ablation studies are conducted on the DOTA dataset. The official baseline method of the DOTA dataset, i.e., RT-Pytorch, is also chosen as the benchmark target detection method, and the effectiveness is evaluated by the mAP as well. Experiments results are shown in Table 4.

It can be seen that each part of our proposed data augmentation method can really boost the performance of target detection. And the combination of them gives full play to the advantages of the two individual parts, so the final mAP can be remarkably enhanced by 2.9% compared with the baseline RT-Pytorch method.

4.7. Meaningful Focusing Regions vs. Only Positive Regions

In this section, a comparison between the use of all the meaningful focusing regions and only the positive regions for data augmentation will be given. Object detection for typical small ship targets in optical remotely sensed images is selected as the comparison task, as in [43]. And to accurately depict the positive regions, HeatNet [44] is selected as the baseline detection method.

The detection results are shown in Table 5. From Table 5, it can be clearly seen that the result of using all the meaningful focusing regions surpasses that of only using positive regions, which validates the effectiveness of using of all the meaningful focusing regions for data augmentation operations.

4.8. Comparison on the HRSC2016 Dataset

In this section, our proposed ROBIT method is also validated on the HRSC2016 dataset [34]. And the ReDet method [45] is selected as the baseline detection method to verify the effectiveness of our proposed method across different deep learning models. The experimental results are shown in Table 6. We can see that the detection accuracy can be improved by 1.3% using our proposed ROBIT method.

5. Discussion

As mentioned in Section 3.4, given a

C

-channel original image

P_{o r i} \in R^{H \times W \times C}

with the height and width of a meaningful focusing region

P_{e}

denoted as

h_{e}

and

w_{e}

, the computational complexity of single-stream ROBIT data augmentation is

O (n \times h_{e} \times w_{e})

; here

n

denotes the number of meaning-focusing regions. Therefore, the computational complexity of a given training image

O (n \times h_{e} \times w_{e})

is approximate to

O (H \times W)

. And the computational complexity of the entire training dataset is

O (N \times H \times W)

. Compared with the training of the target detection network, the runtime of our proposed data augmentation is negligible. When there are too many targets in the remote sensing image, such as if thousands of targets are included in one typical image, the computational complexity and runtime will increase.

6. Conclusions

To solve the problem of internal structure loss in data augmentation, this paper proposes a region-focusing data augmentation via salient region activation and bitplane recombination. More specifically, to boost the efficiency of positive regions and typical negative regions, a new strategy of surroundedness-based salient region activation is proposed, by which new samples with meaningful focusing regions can be generated. And to generate new samples of the focusing regions, a region-based strategy of bitplane recombination is conducted, by which the internal structures can be reserved. Thus, a multiplied effect could be acquired by the advantages of these two strategies. Finally, experiments on target detection have demonstrated the effectiveness of this proposed ROBIT method, especially for small targets. As a matter of fact, although we only used our proposed method for the task of target detection in multi-spectral images, it can also be used for other related tasks with different image sources, such as classification in multi-spectral images, spectral target detection in hyperspectral images and even lesion detection in medical images; these would be some of the future works we need to consider.

Author Contributions

Methodology, H.Z.; Investigation, X.H.; Writing—original draft, H.Z.; Writing—review and editing, W.S.; Supervision, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the National Natural Science Foundation (41971294, 82471999), the Beijing Institute of Technology Research Fund Program for Young Scholars, and the Cross-Media Intelligent Technology Project of BNRist (BNR2019TD01022) of China.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
Xie, Y.; Zhan, N.; Zhu, J.; Xu, B.; Chen, H.; Mao, W.; Luo, X.; Hu, Y. Landslide Extraction from Aerial Imagery Considering Context Association Characteristics. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103950. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, J.; Chen, H.; Xie, Y.; Gu, H.; Lian, H. A Cross-View Intelligent Person Search Method Based on Multi-Feature Constraints. Int. J. Digit. Earth 2024, 17, 2346259. [Google Scholar] [CrossRef]
Xu, Y.; Hou, J.; Zhu, X.; Wang, C.; Shi, H.; Wang, J.; Li, Y.; Ren, P. Hyperspectral Image Super-Resolution with ConvLSTM Skip-Connections. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5519016. [Google Scholar] [CrossRef]
Cao, S.; Feng, D.; Liu, S.; Xu, W.; Chen, H.; Xie, Y.; Zhang, H.; Pirasteh, S.; Zhu, J. BEMRF-Net: Boundary Enhancement and Multiscale Refinement Fusion for Building Extraction from Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16342–16358. [Google Scholar] [CrossRef]
Zhang, H.; Han, X.; Deng, J.; Sun, W. How to Evaluate and Remove the Weakened Bands in Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhang, H.; Xu, Z.; Han, X.; Sun, W. Data Augmentation using Bitplane Information Recombination Model. IEEE Trans. Image Process. 2022, 31, 3713–3725. [Google Scholar] [CrossRef]
Wang, H.; Zhang, W.; Bai, L.; Ren, P. Metalantis: A Comprehensive Underwater Image Enhancement Framework. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5618319. [Google Scholar] [CrossRef]
Liang, J.; Liang, S.; Liu, A.; Ma, K.; Li, J.; Cao, X. Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. Int. Conf. Learn. Represent. 2018. [Google Scholar]
Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas, I.; Lopez-Paz, D.; Bengio, Y. Manifold mixup: Better Representations by Interpolating Hidden States. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6438–6447. [Google Scholar]
Mariani, G.; Scheidegger, F.; Istrate, R.; Bekas, C.; Malossi, C. Bagan: Data Augmentation with Balancing GAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhao, S.; Liu, Z.; Lin, J.; Zhu, J.Y.; Han, S. Differentiable Augmentation for Data-efficient GAN Training. Adv. Neural Inf. Process. Syst. 2020, 33, 7559–7570. [Google Scholar]
Jiang, L.; Dai, B.; Wu, W.; Loy, C.C. Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data. Adv. Neural Inf. Process. Syst. 2021, 34, 21655–21667. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. AAAI Conf. Artif. Intell. 2020, 34, 13001–13008. [Google Scholar] [CrossRef]
Kumar Singh, K.; Jae Lee, Y. Hide-and-seek: Forcing a Network to Be Meticulous for Weakly-supervised Object and Action Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3524–3533. [Google Scholar]
Chen, P.; Liu, S.; Zhao, H.; Wang, X.; Jia, J. Gridmask Data Augmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Takahashi, R.; Matsubara, T.; Uehara, K. Ricap: Random Image Cropping and Patching Data Augmentation for Deep CNNs. In Proceedings of the Asian Conference on Machine Learning, PMLR, Beijing, China, 14–16 November 2018; pp. 786–798. [Google Scholar]
Droste, R.; Jiao, J.; Noble, J. Unified image and video saliency modeling. In Proceedings of the European Computer Vision Conference, Glasgow, UK, 23–28 August 2020; Volume 16, pp. 419–435. [Google Scholar]
Liu, N.; Han, J. A Deep Spatial Contextual Longterm Recurrent Convolutional Network for Saliency Detection. IEEE Trans. Image Process. 2018, 27, 3264–3274. [Google Scholar] [CrossRef] [PubMed]
Djilali, Y.A.D.; McGuinness, K.; O’Connor, N. Learning Saliency from Fixations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 383–393. [Google Scholar]
Hosseini, A.; Kazerouni, A.; Akhavan, S.; Brudno, M.; Taati, B. SUM: Saliency Unification through Mamba for Visual Attention Modeling. arXiv 2024, arXiv:2406.17815. [Google Scholar]
Khan, Z.F.; Kannan, A. Intelligent Segmentation of Medical Images using Fuzzy Bitplane Thresholding. Meas. Sci. Rev. 2014, 14, 94–101. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K.; Singh, R.K. Local Bit-plane Decoded Pattern: A Novel Feature Descriptor for Biomedical Image Retrieval. IEEE J. Biomed. Health Inform. 2015, 20, 1139–1147. [Google Scholar] [CrossRef]
Tuan, T.A.; Kim, J.Y.; Bao, P.T. 3D Brain Magnetic Resonance Imaging Segmentation by Using Bitplane and Adaptive Fast Marching. Int. J. Imaging Syst. Technol. 2018, 28, 223–230. [Google Scholar] [CrossRef]
Vladimir, V. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
He, Z.; Xie, L.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhang, J.; Sclaroff, S. Exploiting Surroundedness for Saliency Detection: A Boolean Map Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 889–902. [Google Scholar] [CrossRef] [PubMed]
Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A High Resolution Optical Satellite Image Dataset for Ship Recognition and Some New Baselines. Int. Conf. Pattern Recognit. Appl. Methods 2017, 2, 324–331. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented Scene Text Detection via Rotation Proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery. In Asian Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2018; pp. 150–165. [Google Scholar]
Fu, K.; Chang, Z.; Zhang, Y.; Xu, G.; Zhang, K.; Sun, X. Rotation-aware and Multi-scale Convolutional Neural Network for Object Detection in Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 161, 294–308. [Google Scholar] [CrossRef]
Yang, X.; Sun, H.; Sun, X.; Yan, M.; Guo, Z.; Fu, K. Position Detection and Direction Prediction for Arbitrary-oriented Ships via Multiscale Rotation Region Convolutional Neural Network. IEEE Access 2018, 6, 50839–50849. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Xia, G.-S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Leng, W.; Han, X.; Sun, W. MOON: A Subspace-Based Multi-Branch Network for Object Detection in Remotely Sensed Images. Remote Sens. 2023, 15, 4201. [Google Scholar] [CrossRef]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented Object Detection in Aerial Images with Box Boundary-aware Vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2150–2159. [Google Scholar]
Zhang, H.; Xu, Z.; Han, X.; Sun, W. Refining FFT-based Heatmap for the Detection of Cluster Distributed Targets in Satellite Images. In Proceedings of the British Machine Vision Conference, Online, 22–25 November 2021. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A Rotation-equivariant Detector for Aerial Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 2786–2795. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined Single-stage Detector with Feature Refinement for Rotating Object. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3163–3171. [Google Scholar] [CrossRef]

Figure 1. The extraction of meaningful focusing regions on sample image which contains numerous small targets.

Figure 2. The diagram of emphasized small targets on the averaged salient map. (a) Original image which contains small targets, where only positive regions are marked with yellow boxes. (b) Averaged salient map which emphasizes all meaningful regions, where typical negative regions are marked with green boxes.

Figure 3. Region-focused data augmentation with different sub-policies. (a) Original image. (b) Averaged salient map of the original image. (c) Region-focused data augmentation with the

Ω_{1}

sub-policy. (d) Region-focused data augmentation with the

Ω_{2}

sub-policy. (e) Region-focused data augmentation with the

Ω_{3}

sub-policy. (f) Region-focused data augmentation with the

Ω_{4}

sub-policy.

Figure 3. Region-focused data augmentation with different sub-policies. (a) Original image. (b) Averaged salient map of the original image. (c) Region-focused data augmentation with the

Ω_{1}

sub-policy. (d) Region-focused data augmentation with the

Ω_{2}

sub-policy. (e) Region-focused data augmentation with the

Ω_{3}

sub-policy. (f) Region-focused data augmentation with the

Ω_{4}

sub-policy.

Figure 4. Total framework of our proposed method.

Table 1. Comparison results of different object detection methods on the DOTA server.

	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
RRPN [36]	80.9	65.8	35.3	67.4	59.9	50.9	55.8	90.7	66.9	72.4	55.1	52.2	55.1	53.4	48.2	61.0
Yang et al. [40]	81.3	71.4	36.5	67.4	61.2	50.9	56.6	90.7	68.1	72.4	55.1	55.6	62.4	53.4	51.5	62.3
DCN [37]	85.0	75.7	31.6	73.4	58.9	46.1	60.7	89.6	75.2	76.8	52.3	61.7	47.6	58.7	44.0	62.5
ICN [38]	81.4	74.3	47.7	70.3	64.9	67.8	70.0	90.8	79.1	78.2	53.6	62.9	67.0	64.2	50.2	68.2
FFA-3 [39]	88.8	74.4	48.8	57.9	63.6	75.9	79.6	90.8	80.3	82.9	54.3	60.0	66.9	66.8	42.5	68.9
RT-Mxnet [35]	88.6	78.5	43.4	75.9	68.8	73.7	83.6	90.7	77.3	81.5	58.4	53.5	62.8	58.9	47.7	69.6
RT-Pytorch [41]	88.0	76.1	52.6	72.1	78.0	77.8	87.3	90.4	84.7	82.7	53.9	62.7	75.4	68.2	56.4	73.8
RT-Pytorch + GHM [42]	88.7	77.4	53.9	77.4	77.6	77.6	87.7	90.8	86.8	85.6	61.9	60.1	76.1	70.5	64.3	75.8
RT-Pytorch + ROBIT	89.2	84.1	54.8	76.0	78.4	82.8	87.7	90.8	84.5	85.4	65.6	63.2	77.1	71.6	59.5	76.7

Table 2. Comparison results of different augmentation methods on the DOTA server.

	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
RT-Pytorch	88.0	76.1	52.6	72.1	78.0	77.8	87.3	90.4	84.7	82.7	53.9	62.7	75.4	68.2	56.4	73.8
RT-Pytorch + C&R	89.4	78.2	54.3	72.7	72.4	76.5	87.7	90.8	80.5	85.7	60.1	61.9	76.0	72.4	58.7	74.5
RT-Pytorch + C&S	88.9	76.4	54.3	76.8	73.4	76.7	87.4	90.8	86.9	78.7	64.7	63.0	75.8	71.1	55.2	74.7
RT-Pytorch + BIRD	88.9	78.6	52.7	75.7	71.8	76.9	87.6	90.8	85.6	84.2	63.4	62.3	77.0	70.8	55.1	74.8
RT-Pytorch + rBIRD	88.9	82.6	53.0	77.1	73.1	77.1	87.5	90.8	86.3	84.4	62.7	61.3	74.8	71.1	58.4	75.3
RT-Pytorch + IKD	89.2	83.2	53.7	75.5	77.7	82.2	87.6	90.7	86.9	84.9	61.9	62.8	77.1	71.7	58.7	76.2
RT-Pytorch + ROBIT	89.2	84.1	54.8	76.0	78.4	82.8	87.7	90.8	84.5	85.4	65.6	63.2	77.1	71.6	59.5	76.7

Table 3. Experiments on the sub-polices for region-based data augmentation.

	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
RT-Pytorch	88.0	76.1	52.6	72.1	78.0	77.8	87.3	90.4	84.7	82.7	53.9	62.7	75.4	68.2	56.4	73.8
RT-Pytorch + $Ω_{1}$	88.9	78.2	54.9	75.7	78.1	77.9	87.5	90.9	86.3	85.4	61.0	65.5	77.2	71.8	58.0	75.8
RT-Pytorch + $Ω_{2}$	89.1	83.3	54.6	74.3	78.3	77.7	87.8	90.8	87.0	85.7	67.2	62.2	77.1	70.9	63.6	76.6
RT-Pytorch + $Ω_{3}$	89.3	84.7	54.7	76.7	78.4	82.7	87.6	90.7	87.1	85.5	64.9	62.2	76.9	71.5	56.6	76.6
RT-Pytorch + $Ω_{4}$	89.2	84.1	54.8	76.0	78.4	82.8	87.7	90.8	84.5	85.4	65.6	63.2	77.1	71.6	59.5	76.7

Table 4. Ablation study on the DOTA server, where bitplane recombination is termed BR and region focusing is termed RF.

	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
RT-Pytorch	88.0	76.1	52.6	72.1	78.0	77.8	87.3	90.4	84.7	82.7	53.9	62.7	75.4	68.2	56.4	73.8
RT-Pytorch + BR	88.9	78.6	52.7	75.7	71.8	76.9	87.6	90.8	85.6	84.2	63.4	62.3	77.0	70.8	55.1	74.8
RT-Pytorch + RF	88.9	78.0	54.6	75.3	73.7	77.7	87.5	90.8	86.5	85.6	60.1	61.6	76.8	69.7	59.2	75.1
RT-Pytorch +ROBIT	89.2	84.1	54.8	76.0	78.4	82.8	87.7	90.8	84.5	85.4	65.6	63.2	77.1	71.6	59.5	76.7

Table 5. Comparison results on all regions and only positive regions.

Method	mAP
HeatNet	74.8
HeatNet + only positive samples	75.2
HeatNet + ROBIT	75.5

Table 6. Comparison results on the HRSC2016 dataset.

Method	mAP
RC2 [34]	75.7
RRPN [36]	79.1
RT [35]	80.1
BBAvector [43]	82.8
HeatNet [44]	84.7
R3Det [46]	89.3
ReDet [45]	96.4
ReDet+ROBIT	97.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Han, X.; Sun, W. Region-Focusing Data Augmentation via Salient Region Activation and Bitplane Recombination for Target Detection. Remote Sens. 2024, 16, 4806. https://doi.org/10.3390/rs16244806

AMA Style

Zhang H, Han X, Sun W. Region-Focusing Data Augmentation via Salient Region Activation and Bitplane Recombination for Target Detection. Remote Sensing. 2024; 16(24):4806. https://doi.org/10.3390/rs16244806

Chicago/Turabian Style

Zhang, Huan, Xiaolin Han, and Weidong Sun. 2024. "Region-Focusing Data Augmentation via Salient Region Activation and Bitplane Recombination for Target Detection" Remote Sensing 16, no. 24: 4806. https://doi.org/10.3390/rs16244806

APA Style

Zhang, H., Han, X., & Sun, W. (2024). Region-Focusing Data Augmentation via Salient Region Activation and Bitplane Recombination for Target Detection. Remote Sensing, 16(24), 4806. https://doi.org/10.3390/rs16244806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Region-Focusing Data Augmentation via Salient Region Activation and Bitplane Recombination for Target Detection

Abstract

1. Introduction

2. Related Works

2.1. Saliency Detection

2.2. Bitplane Techniques

3. Methods

3.1. Empirical Risk of Data Augmentation

3.2. Region-Focusing Based on Salient Region Activation

3.3. Region-Based Data Augmentation

3.4. Total Framework of the Proposed Method

4. Results

4.1. Dataset

4.2. Experimental Settings of Our Proposed Method

4.3. Comparison with Different Object Detection Methods

4.4. Comparison with Different Data Augmentation Methods

4.5. Effectiveness of the Sub-Policies in Region-Based Operations

4.6. Ablation Study

4.7. Meaningful Focusing Regions vs. Only Positive Regions

4.8. Comparison on the HRSC2016 Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI