Lightweight Neural Network for Centroid Detection of Weak, Small Infrared Targets via Background Matching in Complex Scenes

Xu, Xiangdong; Wang, Jiarong; Sha, Zhichao; Nie, Haitao; Zhu, Ming; Nie, Yu

doi:10.3390/rs16224301

Open AccessArticle

Lightweight Neural Network for Centroid Detection of Weak, Small Infrared Targets via Background Matching in Complex Scenes

by

Xiangdong Xu

^1,2,

Jiarong Wang

^1,*

,

Zhichao Sha

³,

Haitao Nie

¹,

Ming Zhu

¹ and

Yu Nie

^1,2

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(22), 4301; https://doi.org/10.3390/rs16224301

Submission received: 19 June 2024 / Revised: 24 July 2024 / Accepted: 14 November 2024 / Published: 18 November 2024

(This article belongs to the Special Issue Advancements in AI-Based Remote Sensing Object Detection)

Download

Browse Figures

Versions Notes

Abstract

:

In applications such as aerial object interception and ballistic estimation, it is crucial to precisely detect the centroid position of the target rather than to merely identify the position of the target bounding box or segment all pixels belonging to the target. Due to the typically long distances between targets and imaging devices in such scenarios, targets often exhibit a low contrast and appear as dim, obscure shapes in infrared images, which represents a challenge for human observation. To rapidly and accurately detect small targets, this paper proposes a lightweight, end-to-end detection network for small infrared targets. Unlike existing methods, the input of this network is five consecutive images after background matching. This design significantly improves the network’s ability to extract target motion features and effectively reduces the interference of static backgrounds. The network mainly consists of a local feature aggregation module (LFAM), which uses multiple-sized convolution kernels to capture multi-scale features in parallel and integrates multiple spatial attention mechanisms to achieve accurate feature fusion and effective background suppression, thereby enhancing the ability to detect small targets. To improve the accuracy of predicted target centroids, a centroid correction algorithm is designed. In summary, this paper presents a lightweight centroid detection network based on background matching for weak, small infrared targets. The experimental results show that, compared to directly inputting a sequence of images into the neural network, inputting a sequence of images processed by background matching can increase the detection rate by 9.88%. Using the centroid correction algorithm proposed in this paper can therefore improve the centroid localization accuracy by 0.0134.

Keywords:

infrared weak small target; background matching; target detection; centroid detection; complex backgrounds

1. Introduction

In the fields of aerospace [1,2] and warning detection [3,4], precise target detection and ballistic estimation are critical technological requirements. Infrared small target centroid detection technology is particularly important due to its ability to accurately determine the centroid position of targets. Infrared imagery, unlike visible light images, is unaffected by day and night conditions, enabling effective target detection under various lighting conditions. Traditional methods for infrared small target detection typically focus on acquiring target bounding boxes or segmenting pixels, which do not directly provide the precise coordinates of the target centroid.

This paper aims to explore infrared small target centroid detection technology, applicable to targets such as missiles, rockets, and hypersonic aircraft, through the analysis and processing of infrared images to precisely predict target centroid positions. This research fills the gaps in existing infrared small target detection techniques regarding centroid localization, thereby offering crucial technological support for achieving more accurate target tracking and interception. In this section, we will discuss the challenges faced in the field of infrared small target detection, the existing state-of-the-art methods, and our contributions.

1.1. Challenges

Firstly, the background of infrared remote sensing images is typically complex. As shown in Figure 1, there are often many interference sources in the image that are similar in shape to the target, which can lead to the algorithm generating a large number of false alarms.

Secondly, the shapes of targets in infrared remote sensing images are sometimes irregular. Therefore, it is not possible to accurately calculate the centroid of the predicted target based on simple mathematical formulas using either the predicted target bounding box or segmented target pixels, thus leading to low accuracy in centroid localization.

Lastly, there is a high demand for real-time target detection in the field of infrared warning detection. Algorithms with high real-time performance often have lower detection rates, while algorithms with higher detection rates often have a lower real-time performance. Achieving high levels of both real-time performance and detection rate simultaneously is challenging.

1.2. Existing State-Of-The-Art Methods

To date, infrared small target detection methods exhibit diverse approaches, primarily categorized into traditional image processing-based methods and deep learning-based methods. Traditional image processing-based approaches encompass filtering methods [5,6,7], low-rank sparse recovery methods [8,9,10], and methods based on the human visual system [11,12,13]. These methods typically detect targets only in relatively simple backgrounds and often achieve lower detection accuracy. In complex backgrounds, especially under high brightness conditions, these methods struggle to effectively differentiate between targets and background. Moreover, their usage is relatively complex, requiring adjustments of multiple parameters for different scenarios, thus limiting their general applicability. With the advancement of deep learning, traditional image processing-based methods are gradually being supplanted by deep learning-based methods.

Deep learning-based methods are predominantly divided into convolutional neural network (CNN)-based [14,15,16,17,18] and transformer-based methods [19,20,21,22,23]. These methods leverage extensive data to train neural networks, thereby enhancing the accuracy and robustness of target detection. The advantage of deep learning lies in its ability to optimize neural networks using large-scale data, enabling adaptation to various application scenarios. These methods find widespread application in the field of target detection, continually yielding numerous classic technical solutions. Dai et al. [24] integrated model- and data-driven methods into neural networks to address the issue of limited learnable features for weak, small infrared targets. Yuan et al. [25] proposed SCTransNet, which utilizes spatial-channel cross transformer blocks to establish correlations between encoder and decoder features, predicting contextual differences between targets and backgrounds in deeper network layers. Kou et al. [26] employed semantic segmentation to detect small targets and utilized a method of adjusting the target size threshold to remove false alarms. Li et al. [27] utilized the dense nested attention network to enhance the intersection over union (IOU) of small target segmentation results, but this algorithm performs poorly in false alarm removal. Chen et al. [28] utilized semantic segmentation tasks that focus more on pixel-level features to assist in the training process of small target detection, thereby improving the average detection accuracy and enabling the detection of some previously undetectable targets.

1.3. Our Contributions

In order to accurately detect small targets in infrared images cluttered with interference, we provide background-matched sequence images to the neural network, a method which enhances the neural network’s ability to extract the motion features of the targets. The experimental results demonstrate that, compared to feeding single-frame images into the neural network, feeding unmatched sequence images into the neural network increases the detection rate by 8.42%. Furthermore, compared to feeding unmatched sequence images into the neural network, feeding matched sequence images into the neural network increases the detection rate by 9.88%.

To ensure the centroid positions output by the neural network are closer to the true centroid positions of the targets, we designed a weighted grayscale centroid method (WCM) that calculates more accurate centroid positions based on the image patches around the centroid positions output by the neural network. Additionally, this method can be appended after all bounding box detection networks or semantic segmentation networks for centroid calculation. Experimental results indicate that, after applying this method, the centroid localization accuracy can be improved by 0.0134.

We devised a simple network architecture and a local feature aggregation module (LFAM), which innovatively integrates the network structure of Inception V3 with a spatial attention module. The introduction of the spatial attention module enables the network to predict the centroid positions of weak, small infrared targets more accurately. The experimental results indicate that, compared to conventional fully convolutional neural networks, our designed LFAM enhances the detection rate by 2.42% and improves centroid localization accuracy by 0.0160.

2. Materials and Methods

This section will elaborate on the complete workflow and core details of the proposed algorithm (as shown in Figure 2). Section 2.1 and Section 2.2, respectively, introduce the processes of data preprocessing and label preprocessing, providing standardized inputs for subsequent network training. Section 2.3 outlines the structure of our proposed centroid detection network along with its key component, LFAM. Finally, Section 2.4 describes the centroid calculation method, including the method for converting the centroid map output by the neural network into coordinates and the centroid correction algorithm.

2.1. Data and Label Preprocessing

2.1.1. Data Preprocessing

The dataset used in this study is a publicly available dataset provided by Hui et al. [29], and its data characteristics are shown in Table 1. The targets in the images are small aircraft, but sometimes, due to the targets being too far from the imaging device, they appear only as bright points in the images. It is almost impossible for the human eye to detect the targets in the images with just a single frame. Therefore, it is necessary to combine the historical information of sequential images to detect the targets. Background matching is conducted prior to inputting the sequential images into the neural network to ensure that the neural network can effectively extract the motion features of the targets.

In a long sequence of images, the background gradually shifts as the target moves. Due to the considerable distance between the target and the imaging device and the minimal angle of movement of the device, images before and after device movement can be considered as captured from the same perspective. We employ a sliding window approach with a window size of 5 and a stride of 1 to extract consecutive sets of 5 images from the image sequence. In each window, the last image serves as a reference image, which is matched against the preceding four images. The objective is to identify background patches with similar grayscale values between these two images. Because the movement angle of the device is very small each time, a small image patch from the center of the reference image can always be found at some position among the preceding four images. This transforms the background matching problem into an image patch matching problem. After identifying similar image patches between two images, we determine the movement direction and distance of the device in two-dimensional space. Once the movement direction and distance are determined, the backgrounds of the two images can be aligned. The steps of the background matching process are outlined as follows.

Within the central region of this reference image, we define a dashed box region

S

(as shown in Figure 3a). Next, we slide an 80 × 80 windw with a stride of 2 along both the x and y directions within this dashed box to search for the pixel area with the highest complexity c. The region with the highest complexity is identified as the matching region

M_{1}

, delineated by a solid box in Figure 3a. Regarding the calculation of complexity c, we adopt the following equation:

c = \frac{σ^{2} (I_{80 \times 80})}{μ (I_{80 \times 80})}, I_{80 \times 80} \in S

(1)

Here,

σ^{2}

represents the variance of the image block, and

μ

represents the mean of the image block. In Figure 3b, we select an area that is the same position as the solid line box in the reference image and expand it outward by several pixels as the search range (as indicated by the red solid line box in Figure 3b). In this area, we similarly use an 80 × 80 sliding window to search for the region

M_{2}

(as indicated by the green solid line box in Figure 3b), which is most similar to the matching region

M_{1}

in the reference image. The equation for the calculation of similarity, s, is as follows:

s = \frac{1}{\sum |M_{1} [∷ 5] - M_{2} [∷ 5]|}

(2)

Here,

M_{1} [∷ 5]

indicates that every 5th pixel in both the x and y directions is selected from the matching region

M_{1}

, and

M_{2} [∷ 5]

indicates that every 5th pixel in both the x and y directions is selected from the region

M_{2}

. The benefit of sampling every 5th pixel is to reduce computational workload. After finding the region

M_{2}

with the highest similarity s, we calculate the displacement in the x and y directions between

M_{2}

and

M_{1}

. Based on this displacement, we determine the portion of the image that needs to be retained (as indicated by the red solid line box in Figure 3c). Subsequently, the portion needing to be retained from Figure 3b is “copied” to the bottom right corner of the reference image in Figure 3a, resulting in the final image as shown in Figure 3d.

2.1.2. Label Preprocessing

In the dataset used in this paper, the labels are formatted as the coordinates of the centroids of the targets in each image. We must employ a simplified form of the two-dimensional Gaussian formula, as presented in Equation (3), to calculate the centroid map corresponding to the target centroids. Specifically, the centroid map has the same dimensions as the input image. For each position (

x

,

y

) in the centroid map, its value Val is calculated using the following equation:

V a l = e^{- \frac{{(x_{0} - x)}^{2} + {(y_{0} - y)}^{2}}{2 σ^{2}}}

(3)

Here,

x_{0}

and

y_{0}

represent the horizontal and vertical coordinates of the target centroid. The hyperparameter

σ^{2}

determines the spread of the Gaussian formula, affecting the range of pixels with significant intensities in the centroid map. As shown in Figure 4, when

σ^{2}

is small, the high-energy region in the centroid map will be more closely concentrated around the target centroid, whereas when

σ^{2}

is large, the high-energy region will cover a wider range of pixels.

We cannot directly use these coordinates as labels for training the neural network for the following reasons:

Directly predicting a centroid coordinate based on the input images is a regression task, which is more complex than predicting the centroid map. When a neural network is trained to predict the coordinate values of targets, it is essentially learning how to map the visual features within images onto these numerical coordinates. However, the neural network might merely treat the output coordinate values as numerical entities with magnitudes, without comprehending the positional information they represent. This significantly undermines the model’s generalization ability, resulting in the network only achieving good regression results on the training set but not on the test set, thereby leading to overfitting. This is precisely why algorithms like YOLOv5 do not directly output the coordinate values of targets but instead produce feature maps of various sizes.
Directly predicting centroid coordinates may result in error accumulation. Since the network needs to predict accurate coordinates immediately, any small prediction error may lead to a significant deviation in the final result. Predicting the centroid map can gradually reduce errors through pixel-level predictions, as the predictions for each pixel are relatively independent.
The centroid map provides detailed spatial distribution information about the target centroid position through pixel intensities. The intensity of each pixel indicates the proximity of that pixel to the target centroid, and this spatial distribution information is highly useful for the network as it helps the network understand the position of the target in the image. In contrast, directly outputting a single centroid coordinate would lead to the loss of this spatial distribution information.

In summary, although directly outputting a single centroid coordinate may be a concise approach in certain cases, in centroid detection networks, having the neural network output a centroid map is often a better choice. This method can provide more spatial distribution information, simplify the regression task, reduce error accumulation, and offer better interpretability.

In centroid detection tasks, we expect the predicted centroid map to be consistent with the true centroid map in terms of distribution. However, in actual predictions, there may be a certain offset between the predicted centroid map and the true centroid map in the x and y directions. When

σ^{2}

is small, the high-energy region in the centroid map is relatively concentrated, which indicates that pixels predicted to be in close proximity to the true centroid will exert a considerable impact on the loss. Once the centroid offset exceeds a small range, the loss value may remain unchanged, potentially resulting in the slow convergence of the network. Conversely, when

σ^{2}

is large, the high-energy region in the centroid map spreads over a wider range. Additionally, the values in the middle region of the centroid map are similar, leading to minimal changes in loss when the final centroid prediction is offset by a few pixels, which is not conducive to the neural network predicting the optimal solution. Additionally, we conducted experiments to demonstrate the influence of different values of

σ^{2}

on the centroid detection rate and centroid localization accuracy. We provide detailed descriptions of the evaluation metrics for centroid detection rate (Dr) and centroid localization accuracy (mDis) in Section 3.1.2. As shown in Table 2, the centroid detection rate and centroid localization accuracy are highest when

σ^{2}

is set to 2. Whether from the perspective of theoretical analysis or experimental verification, we find that the value of

σ^{2}

is best when it is set to 2. We use the same

σ^{2}

for all training and testing data. This is because the value of

σ^{2}

only affects the range of high-energy regions in the centroid map. The value of

σ^{2}

is independent of the size of the target. Once the value of

σ^{2}

is determined, the proximity of each position in the centroid map to the target centroid can be reasonably represented.

2.2. Network Architecture

The centroid detection network for weak, small infrared targets proposed in this paper is essentially an innovative transformation of traditional semantic segmentation networks. As shown in Figure 5, the entire network structure inherits the classical encoder–decoder architecture. In this network, both in the encoding and decoding stages, we innovatively apply LFAM to extract motion features of the targets from sequence images. The number of channels for the input image is 5. During the encoding stage, through a feature extraction process that deepens layer by layer, the number of channels in the feature map gradually increases from the initial 5 to 8, 16, 32, and finally 64 channels. This progressive strategy can deeply explore the semantic hierarchy of images and construct a rich feature pyramid. Each additional layer of channels enables the model to accurately capture and distinguish different semantic regions in the image. Constructing a rich feature pyramid from low-level details to high-level semantics through multi-level feature extraction lays a solid foundation for subsequent centroid detection tasks. In the decoding stage, the number of channels in the feature map gradually decreases from 64 to 32, 16, and ultimately to a level similar to the number of input image channels. This operation can accurately restore the spatial details of the image and promote effective fusion and presentation of information. By gradually reducing the number of channels, the model focuses more on converting high-level semantic information into specific pixel-level predictions during the decoding process. At the same time, the model is constantly optimizing and integrating multi-level feature information to ensure that the final output preserves the overall structure of the image while accurately reflecting the proximity of each pixel to the target centroid.

During the image encoding process, we deliberately implement three downsampling operations because the targets we focus on are generally small in size, and through three downsampling operations, we can more effectively capture high-level semantic information in the images. For centroid detection tasks, the position of the target centroid is highly likely to be where the pixel with the highest grayscale value in the target is located. Therefore, the moderate downsampling operations used in this study do not result in the loss of useful information in the feature maps. In the decoding stage, we concatenate the feature maps obtained from the encoding stage with those generated in the decoding stage along the channel dimension, and then input them into LFAM for further processing. The encoding stage’s feature maps contain more detailed target information, while those from the decoding stage contain more semantic information about the targets. By merging these two sets of features and processing them through LFAM, we can extract centroid information from both macroscopic and microscopic perspectives. Finally, we concatenate all feature maps generated from the encoding and decoding stages along the channel dimension. Due to variations in the sizes of feature maps from the encoding and decoding stages, we perform upsampling with different scaling factors before concatenation. This resizing ensures all feature maps match the size of the input image. Subsequently, the concatenated feature maps are inputted into a 1 × 1 convolutional layer to generate a single-channel feature map. This feature map represents the neural network’s final output, which we refer to as the centroid map, as it encapsulates centroid information of the targets. With this design, our network can accurately locate the centroid position of weak, small infrared targets.

2.3. Local Feature Aggregation Module

For the problem of extracting motion features of the target from sequence images, we innovatively combine the network structure of Inception V3 [30] with spatial attention modules to design LFAM, as shown in Figure 6. The main feature extraction modules in LFAM use 1 × 1 convolution, 3 × 3 convolution, and 3 × 3 dilation convolution. This approach aims to capture multi-level features of small targets in images and reduce model complexity while ensuring detection rate. The remaining convolution operations in LFAM are all 1 × 1 convolutions, and their main purpose is to change the number of channels in the output result.

In LFAM, we first perform global average pooling and global max pooling on

F_{1}

to initially obtain spatial attention information. Global average pooling constructs a new feature map by averaging the elements within each channel, thereby extracting global information represented by the average features of each channel. This approach is pivotal for comprehending the overall content of an image. Conversely, global max pooling generates a feature map by selecting the maximum value from each channel, emphasizing salient features that are typically the model’s primary focus. Subsequently, this spatial attention information is inputted into a 1 × 1 convolutional layer to adjust the number of channels to 3. Meanwhile,

F_{1}

also undergoes convolution with a kernel size of 3 × 3 to obtain

F_{2} \in R^{(h \times w) \times c_{2}}

, convolution with a kernel size of 3 × 3 and dilation rate of 2 to obtain

F_{3} \in R^{(h \times w) \times c_{2}}

, and convolution with a kernel size of 1 × 1 to obtain

F_{4} \in R^{(h \times w) \times c_{2}}

. These operations aim to capture features of the image from different scales. The three feature maps are then multiplied by the three feature maps generated by the spatial attention mechanism to further highlight the spatial features of the target. Subsequently,

F_{2}

,

F_{3}

, and

F_{4}

are concatenated along the channel dimension and inputted into a 1 × 1 convolutional layer to change the number of output feature maps. The purpose of this step is to integrate features extracted by the convolutional neural network at different scales, producing feature maps that encompass information from various scales and possess higher-level semantic content. Finally, the output of this step is added to the feature map of

F_{1}

processed by a 1 × 1 convolutional layer to generate the final output feature map,

F_{5} \in R^{(h \times w) \times c_{2}}

. Specifically, the equations for the calculation of the feature map

F_{5}

are as follows:

{S a}_{3}, {S a}_{4}, {S a}_{5} = f^{1 \times 1} ([a v g p o o l (F_{1}); m a x p o o l (F_{1})])

(4)

F_{5} = f^{1 \times 1} ([f^{3 \times 3} (F_{1}) \times {S a}_{3}; f^{3 \times 3, d = 2} (F_{1}) \times {S a}_{4}; f^{1 \times 1} (F_{1}) \times {S a}_{5}]) + f^{1 \times 1} (F_{1})

(5)

Here,

f^{1 \times 1}

denotes the convolution with a 1 × 1 kernel,

f^{3 \times 3}

represents convolution with a 3 × 3 kernel,

f^{3 \times 3, d = 2}

represents convolution with a 3 × 3 kernel and dilation rate of 2,

a v g p o o l

denotes global average pooling,

m a x p o o l

denotes global max pooling, and [ ; ] denotes the concatenation of feature maps along the channel dimension. With this unique design, LFAM not only inherits the advantages of Inception V3 in multi-scale feature extraction but also incorporates the essence of spatial attention modules, significantly enhancing the model’s ability to handle spatial information. In the following experiments, we also demonstrated the superiority of the LFAM.

2.4. Post-Processing for Centroid Computation

After the neural network predicts the centroid map, we perform a reverse operation to calculate the centroids of the targets from the centroid map. The specific steps are as follows:

Firstly, we determine the number of high-energy regions (depicted as red regions in Figure 4) in each centroid map, denoted as n. The value of n also represents the number of targets in the centroid map. When the centroid coordinates are (x.5, y.5), the distance between the centroid and the pixel center reaches its maximum value. At this point, the maximum value within the high-energy area will be minimized. According to Equation (3), this maximum value can be calculated to be approximately 0.8. Therefore, we consider two peaks in high-energy areas to represent two targets when both maximum values exceed 0.8 and their positions are separated by more than 16 pixels.
For each high-energy region in the centroid map, we locate the position ( $x_{1}$ , $y_{1}$ ) of the point $P_{1}$ with the highest value within this region. Let $V a l (P_{1})$ denote its value. If the predicted centroid map is accurate enough, the value of each position in the centroid map is calculated by Equation (3). Then, the value of $P_{1}$ is calculated using the following equation:

$V a l (P_{1}) = e^{- \frac{{(x_{0} - x_{1})}^{2} + {(y_{0} - y_{1})}^{2}}{4}}$

(6)
Within the 4-neighborhood N4( $P_{1}$ ) of $P_{1}$ , we find the positions ( $x_{2}$ , $y_{2}$ ) and ( $x_{3}$ , $y_{3}$ ) of points $P_{2}$ and $P_{3}$ in the vertical direction, respectively, with their values denoted as $V a l (P_{2})$ and $V a l (P_{3})$ . Similarly, the values of $P_{2}$ and $P_{3}$ are calculated using the following equations:

$V a l (P_{2}) = e^{- \frac{{(x_{0} - x_{2})}^{2} + {(y_{0} - y_{2})}^{2}}{4}}$

(7)

$V a l (P_{3}) = e^{- \frac{{(x_{0} - x_{3})}^{2} + {(y_{0} - y_{3})}^{2}}{4}}$

(8)
Since $P_{1}$ , $P_{2}$ , and $P_{3}$ are in the same column in the image, $x_{1}$ = $x_{2}$ = $x_{3}$ . According to Equations (6) and (7), we can calculate the centroid’s vertical coordinate $y_{01}$ . According to Equations (6) and (8), we can calculate the centroid’s vertical coordinate $y_{02}$ . We take their average as the final centroid’s vertical coordinate $y_{0}$ .
Within the 4-neighborhood N4( $P_{1}$ ) of $P_{1}$ , we find the positions ( $x_{4}$ , $y_{4}$ ) and ( $x_{5}$ , $y_{5}$ ) of points $P_{4}$ and $P_{5}$ in the horizontal direction, respectively, with their values denoted as $V a l (P_{4})$ and $V a l (P_{5})$ . Similarly, the values of $P_{4}$ and $P_{5}$ are calculated using the following equations:

$V a l (P_{4}) = e^{- \frac{{(x_{0} - x_{4})}^{2} + {(y_{0} - y_{4})}^{2}}{4}}$

(9)

$V a l (P_{5}) = e^{- \frac{{(x_{0} - x_{5})}^{2} + {(y_{0} - y_{5})}^{2}}{4}}$

(10)
Since $P_{1}$ , $P_{4}$ , and $P_{5}$ are in the same row in the image, $y_{1}$ = $y_{4}$ = $y_{5}$ . According to Equations (6) and (9), we can calculate the centroid’s horizontal coordinate $x_{01}$ . According to Equations (6) and (10), we can calculate the centroid’s horizontal coordinate $x_{02}$ . We take their average as the final centroid’s horizontal coordinate $x_{0}$ .
Finally, we repeat steps 2–6 until all n high-energy regions in the centroid map are traversed.

After calculating the target centroids, we designed a centroid correction algorithm to make the predicted centroids more accurate. This algorithm corrects the centroid position based on a weighted sum of pixels around the target centroid, so we refer to it as the weighted grayscale centroid method (WCM). By considering the calculation equation in the conventional grayscale centroid method (GCM), it can be observed that this algorithm assumes that pixels farther from the center of the target image block contribute more to the calculation of the target centroid; however, centroids calculated using this method are inaccurate. In contrast, our proposed WCM considers that pixels closer to the center of the target image block contribute more to the calculation of the target centroid.

The solution method for the GCM is as follows: for an image block B, establish a Cartesian coordinate system with its center as the origin. Then, use the following equations:

m 00 = \sum_{x, y \in B} I (x, y)

(11)

m 01 = \sum_{x, y \in B} x \times I (x, y)

(12)

m 02 = \sum_{x, y \in B} y \times I (x, y)

(13)

C = (\frac{m 01}{m 00}, \frac{m 02}{m 00})

(14)

Here,

I (x, y)

represents the pixel value at position (

x, y

), and

C

represents the coordinates of the centroid obtained in the Cartesian coordinate system established with the image block’s center. Based on the design principles of the GCM algorithm, we optimized and developed the centroid correction algorithm, WCM. According to Figure 7c,d, it can be observed that, when calculating the final centroid position, pixels at different positions within the image block exert varying degrees of influence on the final centroid position. Pixels closer to the centroid predicted by the neural network have a greater impact on the final centroid position. The solution method for our proposed WCM is as follows: for an image block B, establish a Cartesian coordinate system with its center as the origin. Then, use the following equations:

m 001 = \sum_{x, y \in B} \sqrt{|\frac{1}{x + \frac{x}{|x|} y^{2}}|} \times I (x, y), x \neq 0

(15)

m 002 = \sum_{x, y \in B} \sqrt{|\frac{1}{y + \frac{y}{|y|} x^{2}}|} \times I (x, y), y \neq 0

(16)

m 01 = \sum_{x, y \in B} \frac{1}{x + \frac{x}{|x|} y^{2}} \times I (x, y), x \neq 0

(17)

m 02 = \sum_{x, y \in B} \frac{1}{y + \frac{y}{|y|} x^{2}} \times I (x, y), y \neq 0

(18)

C = (\frac{2 m 01}{m 001}, \frac{2 m 02}{m 002})

(19)

Here,

I (x, y)

represents the pixel value at position (

x, y

), and

C

represents the centroid coordinates obtained in the Cartesian coordinate system established with the image block’s center. From Equations (15) and (17), and Figure 7c, it can be observed that pixels farther from the center of the image block have a smaller influence on the calculated centroid offset. Moreover, when calculating the offset in the x-direction, the weight assigned to the pixel at (0, −4) is much smaller than the weight assigned to the pixel at (−4, 0).

We refer to the parts where the

m 01

and

m 02

of the two algorithms are multiplied by

I (x, y)

as the offset weight in the x and y directions, respectively. The offset weight maps of the two algorithms are shown in Figure 7. From the figure, it can be observed that the GCM considers that the farther the pixel is from the target centroid, the greater its effect when calculating the centroid offset. Moreover, when calculating the offset in the x direction, all pixels with the same x value exert the same influence on the centroid offset calculation, which is obviously unreasonable. In contrast, our proposed algorithm assigns a greater influence to pixels closer to the target centroid when calculating the offset in the x direction. The computation method of GCM is similar to clustering. A point that is far from the center of an image block or cluster will have a significant impact on the final centroid or cluster center; however, our method differs in this aspect. When computing the centroid of the target, the offset weight of a point decreases as its distance from the center of the image block increases. Consequently, the influence of pixels farther from the center is minimized when calculating the final centroid. If the target shape resembles a tadpole, our algorithm first roughly detects the centroid at the head of the tadpole. Using the GCM to calculate the centroid in such cases would be greatly disturbed by the tail of the tadpole. However, in our algorithm, when calculating the centroid of such targets, the weight of pixels farther from the center of the image block is lower, effectively mitigating this interference. This design is considered reasonable and has achieved good results in previous experiments.

3. Results

In this section, we analyze all of the experimental results to demonstrate the superiority of our proposed method. Section 3.1 presents some experimental details and evaluation metrics for the centroid detection method proposed in this paper. Section 3.2 showcases the comparative results between our proposed method and other state-of-the-art methods. In Section 3.3, we conduct a series of ablation experiments to validate the rationality of our proposed method.

3.1. Implementation Details

3.1.1. Training Settings

In this experiment, we implemented our algorithm on a KYLIN V10 system equipped with an NVIDIA RTX 3090 GPU with 24 GB of memory. We selected 10 out of 14 scenes for training, with a ratio of approximately 4:1 between training and testing data. The input image size for our method is 256 × 256 pixels, and the sequence length of input images is five. Since the labels used during training are centroid maps, where each value represents both the proximity to the target centroid and the probability of being the target centroid at that position, we utilized binary cross-entropy loss to measure the difference between the predicted and ground truth labels. We set the initial learning rate to 0.01 during training, trained for 100 epochs in total, and employed cosine annealing to decrease the final learning rate to 1 × 10⁻⁴.

3.1.2. Evaluation Metrics

To evaluate the precision of centroid detection, we adopted the mean Euclidean distance (mDis) as a key metric. Additionally, to quantify the accuracy of centroid detection, we introduced the metric detection rate (Dr). Specifically, a target is considered successfully detected only if the Euclidean distance between the predicted centroid and the actual centroid is less than two. The detection rate is calculated as the ratio of the number of successfully detected targets to the total number of targets. Furthermore, to measure the frequency of false alarms in each image, we defined the false alarm rate (Fr). It represents the average number of false alarms per image. The equations for these evaluation metrics are as follows:

m D i s = \frac{\sum_{i = 1}^{M} L ({P r e d}_{i}, {G T}_{i}), i f L ({P r e d}_{i}, {G T}_{i}) \leq 2}{\sum_{i = 1}^{M} 1, i f L ({P r e d}_{i}, {G T}_{i}) \leq 2}

(20)

D r = \frac{\sum_{i = 1}^{M} 1, i f L ({P r e d}_{i}, {G T}_{i}) \leq 2}{N}

(21)

F r = \frac{\sum_{i = 1}^{M} 1, i f L ({P r e d}_{i}, {G T}_{i}) > 2}{n}

(22)

Here,

L (,)

denotes the Euclidean distance between two points,

{P r e d}_{i}

represents the predicted target centroid,

{G T}_{i}

represents the actual target centroid,

n

denotes the total number of images in the test set,

N

denotes the total number of targets in the test set, and

M

denotes the total number of predicted targets and false alarms. From the three equations, it can be observed that an increase in

D r

may lead to an increase in

m D i s

; this is because an increase in

D r

indicates that false alarms, originally more than two pixels away from the true centroid of the target, are being successfully detected. When they are successfully detected, their distance from the target centroid may exceed one pixel. As this is one pixel higher than the average centroid localization accuracy, this results in an increase in

m D i s

.

3.2. Comparison with Other State-Of-The-Art Methods

Currently, nearly all related studies are concentrated on bounding box detection or the semantic segmentation of infrared small targets, with centroid detection being relatively rare. Therefore, we compared several mainstream and practical methods for small infrared target bounding box detection and semantic segmentation, including DNAnet [27], YOLOv5, YOLOv7 [31], YOLOv8, YOLOv9 [32], YOLOv10 [33], SSD [34], CenterNet [35], MTU-Net [36], and UCDnet [37]. Apart from UCDnet, which is an algorithm proposed by us in paper, the codes of the other comparative algorithms are sourced from the original source codes provided by the respective authors on GitHub. In other words, the source codes of all algorithms utilized in this paper are provided by their respective authors. Each author specifies the format of the dataset that is required for their algorithm to function correctly. Rather than modifying the algorithms to accommodate our dataset, we adjust the format of the labels in our dataset to comply with the requirements of each algorithm.

Since other methods cannot directly output the centroid position of targets, for bounding box detection methods, we use the neural network-predicted center of the bounding box as the target centroid. For semantic segmentation methods, we use the geometric center of the neural network-predicted mask as the target centroid. We primarily use evaluation metrics such as mDis, Dr, and Fr to assess the performance of each method in terms of centroid localization accuracy, detection rate, and false alarm rate. As our neural network predicts the centroid position of targets, we cannot use traditional evaluation metrics such as IOU to calculate the overlap between predicted and ground truth centroid positions. Therefore, evaluation metrics commonly used for bounding box detection or semantic segmentation algorithms are not applicable in our experiment. Each method is trained to its optimal state. The experimental results are shown in Table 3.

From Table 3, it can be observed that YOLOv5s performs poorly in terms of detection rate and false alarm rate. However, the latest iteration, YOLOv7, performs well in both aspects. YOLOv8n performs poorly across all evaluation metrics. The primary reason lies in the fact that YOLOv8n, being the smallest model in the YOLOv8 series, is designed to offer faster detection speeds. Consequently, its model complexity is relatively low, with fewer parameters and computations. While this lightweight design enhances speed, it may compromise detection accuracy to some extent, particularly when dealing with small targets. Furthermore, YOLOv8n employs an Anchor-Free design, which simplifies the inference process but may not be as precise as Anchor-Based methods (such as YOLOv5 and YOLOv7) in detecting small targets under certain circumstances. This is because anchor boxes provide the model with prior scale information, aiding it in better predicting the bounding boxes of small targets. The algorithms that exhibit extremely poor performance in Table 3 also include YOLOv9 and YOLOv10s. This indicates that during the design process, these algorithms paid more attention to the detection performance for targets of normal size, thereby neglecting the detection performance for extremely small targets (with only a few pixels). The SSD and CenterNet algorithms exhibit poor centroid localization accuracy, relatively high false alarm rates, and high computational complexity. These methods are not suitable for detection tasks requiring high centroid localization accuracy. In contrast, due to their utilization of the pixels where the centroids are located to generate masks, DNAnet, MTU-Net, and UCDnet can accurately detect the positions of target centroids. However, the drawback of these methods is their low detection rates. Through the above comparative experiments, it can be seen that our proposed method performs well in various evaluation metrics such as centroid localization accuracy, detection rate, false alarm rate, and computational complexity. This is because our algorithm can fully leverage the motion features of small targets extracted from sequential images. Even if the size of the target is very small and the contrast is very low, as long as the position of the target changes or the grayscale value of the target changes, we can accurately detect the target. However, existing infrared small target detection algorithms all use single-frame images to detect small targets. Compared to our proposed method, existing methods lack information about the time dimension, resulting in poor detection rates. From the table, it can also be seen that our algorithm exhibits good real-time performance, achieving a detection speed of 40 frames per second.

Since the evaluation metrics for YOLOv8n, YOLOv9, and YOLOv10s are all zero, we will not present the performance of these three algorithms on each test scene. Instead, we will only compare the performance of the other eight algorithms on each test scene (as shown in Table 4). It was found that these eight methods had low detection rates in Scene 6 and Scene 13, and a high false alarm rate in Scene 11. This is because the targets in Scene 6 and Scene 13 exhibit low contrast, while Scene 11 contains numerous false alarms that resemble the shape of the targets. All the semantic segmentation algorithms perform poorly on Scene 6, with DNAnet and MTU-Net being particularly significant, with a detection rate of zero. This is because the target size of Scene 6 is very small (about two pixels) and has extremely low contrast with the background. Both DNAnet and MTU-Net employ highly intricate, densely connected network architectures. For small-sized and low-contrast targets, after undergoing numerous convolutions, pooling, and other operations, the differences between them and the background tend to smooth out, making it harder for the network to capture the presence of the targets. Nevertheless, this densely connected network structure enables a more accurate detection of small targets with higher contrast. This explains why DNAnet and MTU-Net outperform UCDnet in the remaining test scenes. The performance of these eight methods in the four test scenes is shown in Figure 8.

In addition to comparing our proposed centroid detection method with other state-of-the-art methods, we further conducted a comparative analysis between our proposed centroid correction algorithm and the grayscale centroid method to thoroughly evaluate the performance and effectiveness of our algorithm in centroid correction. As shown in Table 5, compared to the grayscale centroid method, our proposed centroid correction algorithm can effectively improve centroid localization accuracy, especially in cases where the centroid localization accuracy is poor. Additionally, Table 5 demonstrates that using our proposed centroid correction algorithm can improve the detection rate and reduce the false alarm rate, because our algorithm can correct predicted centroids outside a two-pixel radius from the target centroid to within two pixels of the target centroid.

In the above experiments, we considered a distance less than two pixels between the predicted centroid and the actual centroid as successfully detecting the target. To further validate the centroid localization accuracy of various algorithms, we set the thresholds to one, two, and three pixels, respectively, to compare the performance of eight algorithms in terms of centroid localization accuracy, detection rate, and false alarm rate. The closer the line plot is to the bottom right corner in Figure 9, the better the algorithm’s performance. From Figure 9, it can be observed that our algorithm performs better in terms of detection rate and false alarm rate compared to other algorithms, while being slightly lower in centroid localization accuracy than DNAnet, MTU-Net, and UCDnet. This is because these three algorithms use the target centroid as the semantic segmentation mask rather than using the entire target’s pixels as the semantic segmentation mask.

3.3. Ablation Experiments

To validate that our proposed algorithm can effectively capture the motion features of targets, we separately input single-frame images, sequences of five unmatched images, sequences of five matched images, sequences of ten matched images, and sequences of twenty matched images into the neural network. As shown in Table 6, feeding five matched images into the neural network significantly improves the detection rate, centroid localization accuracy, and false alarm rate. To assess the impact of background matching, we input both matched and unmatched images into the same neural network. The results indicate that the neural network finds it easier to extract motion features from matched sequence images compared to unmatched ones. Finally, we investigated the effect of different sequence lengths on the results. As shown in Table 6, increasing the length of sequence images slightly improves the detection rate but significantly reduces centroid localization accuracy. The most crucial aspect is that increasing the length of the image sequence results in a longer duration of the background matching process. Therefore, for target detection tasks requiring high real-time performance, it is necessary to utilize shorter sequences of consecutive images as inputs to ensure the efficient operation of the algorithm.

In addition, we also compared our proposed LFAM with a conventional feature extraction module stacked with standard convolutional layers to demonstrate the feature extraction capability of our proposed module. Compared to the standard convolutional layers, our proposed LFAM incorporates a spatial attention module. Additionally, we employ a multi-branch parallel approach to extract features at different scales. In the ablation experiments, we removed the spatial attention module and replaced the multi-branch parallel feature extraction module with standard convolutions with a kernel size of 3 × 3. As shown in Table 7, LFAM achieved an improvement of 0.016 in centroid localization accuracy, an increase of 2.42% in detection rate, and a decrease of 0.0242 in false alarm rate.

4. Discussion

This paper proposes a centroid detection algorithm for weak, small infrared targets, consisting of three main parts: input data preprocessing, a neural network, and centroid correction. Based on the working principle of convolutional neural networks, it is found that by merging the background-matched sequence images along the channel dimension, we can efficiently extract the motion features between images directly using convolutional neural networks. This method helps us to analyze and understand the complex spatiotemporal information contained in image sequences more thoroughly.

In the input data preprocessing part, it is observed that the longer the length of the input image sequence, the more effectively our proposed network can extract the motion features of the target. This is because it can be intuitively seen that the longer the input image sequence, the more obvious the motion of the target in the images. However, the disadvantage of increasing sequence length is the increase in background matching time, which is not conducive to real-time detection. Therefore, the length of the input image sequence can be set according to the real-time requirements of the application scenario.

In the neural network part, it is found that, due to the emphasis of our proposed network on detecting the centroid position of the target rather than accurately predicting the bounding box or segmenting the target pixels, we can design the network structure to be very simple. Even if the network parameters and computational complexity are kept minimal, this will not affect the final target detection performance. Additionally, we analyzed the reasons why our algorithm performed poorly in Scene 6, even lower than SSD and CenterNet. In Scene 6, the targets are mostly stationary, which makes it difficult for our network to extract the motion features of the targets effectively. Moreover, because our network structure is relatively simple, its performance in extracting the spatial characteristics of the targets may be lower than that of other complex neural networks.

In the centroid correction part, it is found that, in cases where the centroid localization accuracy is poor, our proposed centroid correction algorithm can significantly improve the centroid localization accuracy while also increasing the detection rate. It is worth noting that there is usually a certain trade-off between increasing the detection rate and improving the centroid localization accuracy, because increasing the detection rate means moving the predicted centroids that are originally far from the true centroid closer to it. However, our proposed method successfully breaks through this limitation by not only improving the detection rate but also ensuring an increase in centroid localization accuracy, demonstrating the innovative and excellent performance of our algorithm.

5. Conclusions

This paper demonstrates the advancement of our proposed centroid detection method for weak, small infrared targets through a series of reasonable experiments. From these experiments, it is evident that, upon inputting long sequence images into the neural network, the network can easily extract the motion features of targets in sequence images, provided that the targets are moving in the input sequence images. Additionally, we can use centroid correction algorithms to bring the predicted target centroids closer to the actual centroids. These factors all demonstrate the advancements offered by our proposed algorithm. The more detailed detection results of the method we proposed are shown in Figure A1. However, our proposed algorithm also has certain limitations. When the targets are basically stationary, it is difficult for our algorithm to detect them. We believe that adding a tracking algorithm to our proposed algorithm at this time can effectively overcome this limitation. In the future, we will research algorithms that combine detection of weak, small infrared targets with tracking. Weak, small infrared targets have fewer shape features for neural networks to learn compared to targets in visible-light images. Therefore, we plan to focus on researching background modeling-based tracking algorithms for weak, small infrared targets in the future. We have already addressed the problem of weak, small infrared target detection under low-signal-to-noise-ratio conditions, and in the future, we will also focus on the problem of weak, small infrared target detection under extremely low-signal-to-noise-ratio conditions. We hope this research will lead us to many achievements in this area.

Author Contributions

Methodology, X.X., J.W., Z.S., H.N. and M.Z.; software, X.X., J.W., H.N. and Y.N.; writing—original draft, X.X., Z.S., H.N., M.Z. and Y.N.; writing—review and editing, J.W., H.N. and Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Bureau of Changchun, China, under grant number 2024GD03.

Institutional Review Board Statement

Current research is limited to the early warning for small unmanned aerial vehicles (UAVs), which is beneficial for maintaining public order and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving infrared small target centroid detection and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In this section, we will supplement our experimental results. We selected 15 images from the test set where the targets moved slowly and had dim textures, and we used our method to detect their centroids. Figure A1 illustrates the performance of our method on these images. It can be observed from the figure that our algorithm predicted centroids slightly farther from the true centroids only in the 271st frame, while centroids predicted in other images were less than one pixel away from the true centroids. This also intuitively demonstrates the performance of our method on low-signal-to-noise-ratio images.

Figure A1. Detection results of our method on a subset of images from the test dataset, where red points represent the ground truth centroids and green points represent the centroids predicted by our method. The red box in the figure indicates the location of the target.

References

Lin, B.; Yang, X.; Wang, J.; Wang, Y.Y.; Wang, K.P.; Zhang, X.H. A Robust Space Target Detection Algorithm Based on Target Characteristics. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8012405. [Google Scholar] [CrossRef]
Wang, J.W.; Li, G.; Zhao, Z.C.; Jiao, J.; Ding, S.; Wang, K.P.; Duan, M.Y. Space Target Anomaly Detection Based on Gaussian Mixture Model and Micro-Doppler Features. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5118411. [Google Scholar] [CrossRef]
Fang, H.Z.; Ding, L.; Wang, L.M.; Chang, Y.; Yan, L.X.; Han, J.H. Infrared Small UAV Target Detection Based on Depthwise Separable Residual Dense Network and Multiscale Feature Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5019120. [Google Scholar] [CrossRef]
Ma, X.Y.; Li, Y. Edge-Aided Multiscale Context Network for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7001405. [Google Scholar] [CrossRef]
Wenchao, Z.; Yanfei, W.; Hexin, C. Moving Point Target Detection in Complex Background Based on Tophat Transform. J. Image Graph. 2007, 12, 871–874. [Google Scholar]
Qingboa, J.I.; Xingzhou, Z.; Xuezhi, X. A Detection Method for Small Targets Based on Wavelet Transform and Data Fusion. J. Proj. Rocket. Missiles Guid. 2008, 28, 234. [Google Scholar]
Li, Y.S.; Li, Z.Z.; Shen, Y.; Li, J. Infrared Small Target Detection Based on 1-D Difference of Guided Filtering. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7000105. [Google Scholar] [CrossRef]
Luo, J.; Yu, H. Research of Infrared Dim and Small Target Detection Algorithms Based on Low-Rank and Sparse Decomposition. Laser Optoelectron. Prog. 2023, 60, 1600004. [Google Scholar]
Hao, S.; Ma, X.; Fu, Z.X.; Wang, Q.L.; Li, H.A. Landing Cooperative Target Robust Detection via Low Rank and Sparse Matrix Decomposition. In Proceedings of the 3rd International Symposium on Computer, Consumer and Control (IS3C), Xi’an, China, 4–6 July 2016; pp. 172–175. [Google Scholar]
Zhou, W.N.; Xue, X.Y.; Chen, Y. Low-Rank and Sparse Decomposition Based Frame Difference Method for Small Infrared Target Detection in Coastal Surveillance. IEICE Trans. Inf. Syst. 2016, 99, 554–557. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.T.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Qin, Y.; Li, B. Effective Infrared Small Target Detection Utilizing a Novel Local Contrast Method. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1890–1894. [Google Scholar] [CrossRef]
He, Y.J.; Li, M.; Wei, Z.H.; Cai, Y.C. Infrared Small Target Detection Based on Weighted Variation Coefficient Local Contrast Measure. In Proceedings of the 4th Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Beijing, China, 19–21 December 2021; pp. 117–127. [Google Scholar]
Xu, D.Q.; Wu, Y.Q. MRFF-YOLO: A Multi-Receptive Fields Fusion Network for Remote Sensing Target Detection. Remote Sens. 2020, 12, 3118. [Google Scholar] [CrossRef]
Liu, H.; Ding, M.; Li, S.; Xu, Y.B.; Gong, S.L.; Kasule, A.N. Small-Target Detection Based on an Attention Mechanism for Apron-Monitoring Systems. Appl. Sci. 2023, 13, 5231. [Google Scholar] [CrossRef]
Xu, H.; Zhong, S.; Zhang, T.X.; Zou, X. Multiscale Multilevel Residual Feature Fusion for Real-Time Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5002116. [Google Scholar] [CrossRef]
Chen, Q.; Wang, Y.M.; Yang, T.; Zhang, X.Y.; Cheng, J.; Sun, J. You Only Look One-Level Feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13034–13043. [Google Scholar]
Xiong, J.; Wu, J.; Tang, M.; Xiong, P.W.; Huang, Y.S.; Guo, H. Combining YOLO and background subtraction for small dynamic target detection. Visual Comput. 2024. [Google Scholar] [CrossRef]
Lin, J.; Zhang, K.; Yang, X.; Cheng, X.Z.; Li, C.H. Infrared dim and small target detection based on U-Transformer. J. Vis. Commun. Image Represent. 2022, 89, 103684. [Google Scholar] [CrossRef]
Tong, X.Z.; Zuo, Z.; Su, S.J.; Wei, J.Y.; Sun, X.Y.; Wu, P.; Zhao, Z.Q. ST-Trans: Spatial-Temporal Transformer for Infrared Small Target Detection in Sequential Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5001819. [Google Scholar] [CrossRef]
Zhang, F.; Lin, S.L.; Xiao, X.Y.; Wang, Y.; Zhao, Y.Q. Global attention network with multiscale feature fusion for infrared small target detection. Opt. Laser Technol. 2024, 168, 110012. [Google Scholar] [CrossRef]
Chen, G.; Wang, W.H.; Li, X.J. Designing and learning a lightweight network for infrared small target detection via dilated pyramid and semantic distillation. Infrared Phys. Technol. 2023, 131, 104671. [Google Scholar] [CrossRef]
Wang, K.W.; Du, S.Y.; Liu, C.X.; Cao, Z.G. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002013. [Google Scholar] [CrossRef]
Dai, Y.M.; Wu, Y.Q.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Yuan, S.; Qin, H.L.; Yan, X.; Akhtar, N.; Mian, A. SCTransNet: Spatial-Channel Cross Transformer Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5002615. [Google Scholar] [CrossRef]
Kou, R.K.; Wang, C.P.; Yu, Y.; Peng, Z.M.; Huang, F.Y.; Fu, Q. Infrared Small Target Tracking Algorithm via Segmentation Network and Multistrategy Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612912. [Google Scholar] [CrossRef]
Li, B.Y.; Xiao, C.; Wang, L.G.; Wang, Y.Q.; Lin, Z.P.; Li, M.; An, W.; Guo, Y.L. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Chen, Y.H.; Li, L.Y.; Liu, X.; Su, X.F. A Multi-Task Framework for Infrared Small Target Detection and Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5003109. [Google Scholar] [CrossRef]
Hui, B.; Song, Z.; Fan, H.; Zhong, P.; Hu, W.; Zhang, X.; Ling, J.; Su, H.; Jin, W.; Zhang, Y.; et al. A dataset for infrared image dim-small aircraft target detection and tracking under ground/air background. Sci. Data Bank 2019, 5, 291–302. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.; Yeh, I.; Liao, H.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Wu, T.H.; Li, B.Y.; Luo, Y.H.; Wang, Y.Q.; Xiao, C.; Liu, T.; Yang, J.G.; An, W.; Guo, Y.L. MTU-Net: Multilevel TransUNet for Space-Based Infrared Tiny Ship Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601015. [Google Scholar] [CrossRef]
Xu, X.D.; Wang, J.R.; Zhu, M.; Sun, H.J.; Wu, Z.Y.; Wang, Y.; Cao, S.Y.; Liu, S.Z. UCDnet: Double U-Shaped Segmentation Network Cascade Centroid Map Prediction for Infrared Weak Small Target Detection. Remote Sens. 2023, 15, 3736. [Google Scholar] [CrossRef]

Figure 1. Backgrounds in infrared remote sensing images are often highly complex. As depicted in the first row of images, there are many sources of interference in the background that resemble the shape of the target. As shown in the second row, the low grayscale values and contrast of the target make it difficult to perceive.

Figure 2. The complete workflow of our proposed centroid detection method for weak, small infrared targets. This workflow includes four parts: input data preprocessing, predicting centroid map, conversion of centroid map to centroid coordinates, and centroid coordinate correction.

Figure 3. (a) The reference image. The process (b–d) demonstrates the background matching procedure between each image in the sequence and the reference image. The red dashed box in (a) represents the search area for the most complex image block in the image. The red solid box in (a) represents the most complex image block in the search area. The red dashed box in (b) represents the position of the red solid box in (a,b) The red dashed box in (b) represents the search area for the most complex image block in the image. The green solid box in (b) represents the image block with the highest similarity to the image block represented by the red solid box in (a) in the search area. The red solid box in (c) represents the area where the backgrounds overlap in (a,b).

Figure 4. The values of

σ^{2}

in the simplified two-dimensional Gaussian formula correspond to the high-energy regions within the centroid map, with values of 0.5, 1, 2, and 4.

Figure 4. The values of

σ^{2}

in the simplified two-dimensional Gaussian formula correspond to the high-energy regions within the centroid map, with values of 0.5, 1, 2, and 4.

Figure 5. The centroid detection network for weak, small infrared targets proposed herein primarily consists of LFAM. LFAM incorporates spatial attention mechanisms, enabling more accurate detection of the centroids of weak, small infrared targets. The gray cuboid in the middle of the figure represents the result of concatenating all feature maps generated in the encoding and decoding stages along the channel dimension. In the context of the dataset used in this paper, both h and w in the above figure are 256.

Figure 6. The specific architecture of LFAM designed herein combines the structure of Inception V3 with spatial attention modules. The input feature map is first subjected to average pooling and max pooling operations along the channel dimension to generate two spatial attention feature maps. These two feature maps are then expanded to three feature maps and multiplied with feature maps obtained through three different convolutional methods. Finally, the resulting three feature maps from the previous step are merged along the channel dimension and input into a convolutional layer with a 1 × 1 kernel size to generate the final feature map. The feature maps of different colors in the figure are only used to indicate that they are different feature maps.

Figure 7. Offset weight maps of the GCM and our proposed WCM: (a) the offset weight map in the x direction for the GCM; (b) the offset weight map in the y direction for the GCM; (c) the offset weight map in the x direction for our proposed WCM; and (d) the offset weight map in the y direction for our proposed WCM.

Figure 8. Performance of the eight algorithms in four testing scenes. The red box in the figure indicates the location of the target, ‘×’ indicates the true location of the centroid, and the black arrow represents multiple overlapping centroid prediction results for the current pixel. It is worth noting that centroids predicted by different algorithms may be the same.

Figure 9. Line plots showing the centroid localization accuracy and false alarm rate of eight algorithms when the distance between the predicted centroid and the actual centroid is less than 1, 2, and 3 pixels, respectively. This figure provides a clear observation of the performance of different algorithms under different distance thresholds. From the figure, we can see that our proposed algorithm demonstrates excellent performance.

Table 1. Characteristics of data in each scene in the dataset.

Scene	Description	Quantity	Train	Test
Scene 1	Background changes slowly, targets appear as points, targets are small with little variation in size, the smallest target size is just one pixel, and targets move slowly.	3000	√
Scene 4		399	√
Scene 8		1500	√
Scene 11		751		√
Scene 2	The targets have very low contrast and move slowly, and the high complexity of the background causes some targets to be submerged in the background in certain images.	399	√
Scene 6		401		√
Scene 7		745	√
Scene 9		763		√
Scene 10		1426	√
Scene 3	High background complexity, uneven lighting in some images, relatively large and continuously changing target sizes, and fast target movement.	399	√
Scene 5		399	√
Scene 12		500	√
Scene 13		500		√
Scene 14		499	√

Table 2. The influence of different

σ^{2}

values on target centroid detection rate and centroid localization accuracy.

Table 2. The influence of different

σ^{2}

values on target centroid detection rate and centroid localization accuracy.

$σ^{2}$	mDis	Dr
0.5	0.5205	79.66%
1	0.4224	86.83%
2	0.2577	87.37%
4	0.3008	85.66%

Table 3. Comparison of average metrics between our method and ten other methods across all test sets.

Method	mDis	Dr	Fr	FLOPs(G)	Params(M)	FPS
YOLOv5s	0.5226	42.07%	0.3573	1.2641	7.0128	42
YOLOv7 [31]	0.5981	70.02%	0.1867	8.2657	36.4799	45
YOLOv8n	0	0	0	0.6511	3.0110	164
YOLOv9 [32]	0	0	0	3.0972	9.5980	36
YOLOv10s [33]	0	0	0	1.9817	8.0671	102
SSD [34]	0.8270	68.20%	0.3238	30.4530	23.7454	49
CenterNet [35]	1.2932	63.73%	0.3072	8.7420	32.6642	32
DNAnet [27]	0.0082	57.35%	0.1818	14.2479	4.6969	35
MTU-Net [36]	0.0779	60.25%	0.3337	6.2123	8.2202	78
UCDnet [37]	0.0401	58.67%	0.3313	14.7372	9.2762	36
Ours	0.2577	87.37%	0.1263	1.1871	0.1574	40

Table 6. Impact of different input images on the network’s performance.

Input	mDis	Dr	Fr
single-frame images	0.7107	69.07%	0.3093
sequences of 5 unmatched images	0.2185	77.49%	0.2251
sequences of 5 matched images	0.2577	87.37%	0.1263
sequences of 10 matched images	0.4142	89.11%	0.1089
sequences of 20 matched images	0.3500	91.79%	0.0821

Table 7. Comparison between our proposed feature extraction module and a conventional stack of convolutional layers.

Method	mDis	Dr	Fr	FLOPs (G)	Params (M)
CNN	0.2737	84.95%	0.1505	0.8427	0.1231
LFAM	0.2577	87.37%	0.1263	1.1871	0.1574

Table 4. Comparison of our centroid detection method with seven other methods in all test scenes.

Method	Scene 6			Scene 9			Scene 11			Scene 13
Method	mDis	Dr	Fr	mDis	Dr	Fr	mDis	Dr	Fr	mDis	Dr	Fr
YOLOv5s	0.46	42%	0.07	0.60	59%	0.20	0.43	38%	0.70	0.54	22%	0.31
YOLOv7	0.54	41%	0.06	0.60	63%	0.04	0.60	96%	0.51	0.63	65%	0.02
SSD	0.79	53%	0.23	0.98	72%	0.07	0.63	78%	0.61	0.96	60%	0.36
CenterNet	1.08	67%	0.04	1.28	71%	0.14	1.40	78%	0.57	1.29	28%	0.38
DNAnet	0.00	0	0.00	0.01	72%	0.27	0.00	89%	0.22	0.01	34%	0.14
MTU-Net	0.00	0	0.17	0.09	76%	0.09	0.08	92%	0.57	0.03	36%	0.47
UCDnet	0.03	30%	0.04	0.05	67%	0.20	0.03	88%	0.30	0.02	25%	0.80
Ours	0.33	51%	0.49	0.34	95%	0.05	0.20	98%	0.02	0.36	90%	0.10

Table 5. Performance of our proposed method (WCM) and the grayscale centroid method (GCM) in centroid correction.

Method	mDis	Dr	Fr	Method	mDis	Dr	Fr
YOLOv5s	0.5226	42.07%	0.3573	YOLOv7	0.5981	70.02%	0.1867
YOLOv5s + GCM	0.5195	42.07%	0.3573	YOLOv7 + GCM	0.5881	70.02%	0.1867
YOLOv5s + WCM	0.4995	42.07%	0.3573	YOLOv7 + WCM	0.5646	70.02%	0.1867
SSD	0.8270	68.20%	0.3238	CenterNet	1.2932	63.73%	0.3072
SSD + GCM	0.8073	68.45%	0.3213	CenterNet + GCM	1.2525	63.81%	0.3064
SSD + WCM	0.7726	68.45%	0.3213	CenterNet + WCM	1.2226	64.51%	0.2994
DNAnet	0.0082	57.35%	0.1818	MTU-Net	0.0779	60.25%	0.3337
DNAnet + GCM	0.0423	57.35%	0.1818	MTU-Net + GCM	0.1094	60.25%	0.3337
DNAnet + WCM	0.0720	57.35%	0.1818	MTU-Net + WCM	0.1297	60.25%	0.3337
UCDnet	0.0401	58.67%	0.3313	Ours	0.2577	87.37%	0.1263
UCDnet + GCM	0.0734	58.67%	0.3313	Ours + GCM	0.2562	87.41%	0.1259
UCDnet + WCM	0.1015	58.67%	0.3313	Ours + WCM	0.2443	87.41%	0.1259

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Wang, J.; Sha, Z.; Nie, H.; Zhu, M.; Nie, Y. Lightweight Neural Network for Centroid Detection of Weak, Small Infrared Targets via Background Matching in Complex Scenes. Remote Sens. 2024, 16, 4301. https://doi.org/10.3390/rs16224301

AMA Style

Xu X, Wang J, Sha Z, Nie H, Zhu M, Nie Y. Lightweight Neural Network for Centroid Detection of Weak, Small Infrared Targets via Background Matching in Complex Scenes. Remote Sensing. 2024; 16(22):4301. https://doi.org/10.3390/rs16224301

Chicago/Turabian Style

Xu, Xiangdong, Jiarong Wang, Zhichao Sha, Haitao Nie, Ming Zhu, and Yu Nie. 2024. "Lightweight Neural Network for Centroid Detection of Weak, Small Infrared Targets via Background Matching in Complex Scenes" Remote Sensing 16, no. 22: 4301. https://doi.org/10.3390/rs16224301

APA Style

Xu, X., Wang, J., Sha, Z., Nie, H., Zhu, M., & Nie, Y. (2024). Lightweight Neural Network for Centroid Detection of Weak, Small Infrared Targets via Background Matching in Complex Scenes. Remote Sensing, 16(22), 4301. https://doi.org/10.3390/rs16224301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Neural Network for Centroid Detection of Weak, Small Infrared Targets via Background Matching in Complex Scenes

Abstract

1. Introduction

1.1. Challenges

1.2. Existing State-Of-The-Art Methods

1.3. Our Contributions

2. Materials and Methods

2.1. Data and Label Preprocessing

2.1.1. Data Preprocessing

2.1.2. Label Preprocessing

2.2. Network Architecture

2.3. Local Feature Aggregation Module

2.4. Post-Processing for Centroid Computation

3. Results

3.1. Implementation Details

3.1.1. Training Settings

3.1.2. Evaluation Metrics

3.2. Comparison with Other State-Of-The-Art Methods

3.3. Ablation Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI