Multi-Scale Hierarchical Feature Fusion for Infrared Small-Target Detection

Wang, Yue; Wang, Xinhong; Qiu, Shi; Chen, Xianghui; Liu, Zhaoyan; Zhou, Chuncheng; Yao, Weiyuan; Cheng, Hongjia; Zhang, Yu; Wang, Feihong; Shu, Zhan

doi:10.3390/rs17030428

Open AccessArticle

Multi-Scale Hierarchical Feature Fusion for Infrared Small-Target Detection

by

Yue Wang

^1,2,

Xinhong Wang

^1,*,

Shi Qiu

^1,3,

Xianghui Chen

^1,2,

Zhaoyan Liu

¹

,

Chuncheng Zhou

¹,

Weiyuan Yao

¹,

Hongjia Cheng

¹,

Yu Zhang

¹,

Feihong Wang

¹ and

Zhan Shu

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

³

State Key Laboratory of Applied Optics, Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 428; https://doi.org/10.3390/rs17030428

Submission received: 2 November 2024 / Revised: 31 December 2024 / Accepted: 6 January 2025 / Published: 27 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Detecting small targets in infrared images presents significant challenges due to their tiny size and complex backgrounds, making this task a hotspot for research. Traditional methods rely on assumption-based modeling and manual design, struggling to handle the variability of real-world scenarios. Although convolutional neural networks (CNNs) increase robustness to diverse scenes with a data-driven paradigm, many CNN-based methods are insufficient in capturing fine-grained details necessary for small targets and are less effective during multi-scale feature fusion. To overcome these challenges, we propose the novel Wide-scale Gated Fully Fusion Network (WGFFNet) in this article, which contributes to infrared small-target detection (IRSTD). WGFFNet uses a classic encoder–decoder structure, where the designed stepped fusion block (SFB) embedded in the feature extraction stage captures finer local context across multiple scales during encoding, and along the decoding path, the multi-level features are progressively integrated by a Fully Gated Interaction (FGI) Module to enhance feature representation. The inclusion of a boundary difference loss further optimizes the edge details of targets. We conducted comprehensive experiments on two public infrared small-target datasets: SIRST-V2 and IRSTD-1k. Quantitative and qualitative results demonstrate that our WGFFNet outperforms representative methods when considering various evaluation metrics together, achieving an improved detection performance and computational efficiency for detecting small targets in infrared images.

Keywords:

infrared small-target detection; feature fusion; gate mechanism; hierarchical decoding

1. Introduction

Infrared small-target detection (IRSTD) is a crucial subtask in infrared search and tracking systems due to its capability in complex scenarios such as maritime target surveillance [1], ground anomaly monitoring [2], and flight guidance [3]. The accurate detection of infrared small targets would be beneficial for practical applications. However, unlike colored natural images with distinct and relatively large targets in the traditional segmentation task, the characteristics of targets in infrared images pose challenges to the detection process. Typically, the targets are tiny, occupying less than 0.12% of the pixels in a 256 × 256 [4] infrared image due to the long imaging distance. In addition, the interference in the radiation transmission processes and the disturbance of complex background noise and clutter further obscure the shapes and textures of targets, resulting in their dim intensity and low contrast. Therefore, the IRSTD task remains a research hotspot in need of effective improvements.

Early adopted methods employed manually designed features and algorithms that aimed to separate targets from backgrounds by analyzing the physical properties of small targets. Commonly used methods include filter-based [5,6], local contrast-based [7,8,9], and low-rank representation-based methods [10,11,12]. They rely on straightforward assumptions and modeling toward the target characteristics and background contexts. Yet, a rough manual design with multiple sensitive hyperparameters cannot depict various targets and cope with diverse infrared scenarios, leading to unsatisfactory detection performance.

Compared with traditional model-driven methods, deep neural networks learning via a data-driven paradigm have presented better robustness and detection results in IRSTD. It automatically learns features from large amounts of infrared data through training and makes further predictions. Bounding box regression [13,14,15] is commonly used for target detection, but it is not suitable for small targets due to the difficulties of anchor presetting and object pre-positioning [16]. Consequently, most public infrared small-target datasets are labeled in mask format, and semantic segmentation [17,18,19] with an encoder–decoder architecture becomes the mainstream deep learning manner for tackling IRSTD. Small targets tend to be lost after multiple downsampling steps during encoding, while shallow encoders cannot yield strong contextual information. To face this dilemma, most methods use multi-scale feature fusion to bridge spatial and semantic information gaps, for example, the asymmetric contextual modulation module (ACM) introduced by Dai et al. [19] or the strategy of repetitively fusing low- and high-level features proposed by Li et al. [20]. However, as shown in Figure 1, small targets can be effectively captured when the receptive field scales are appropriately sized, which enhances subsequent feature fusion strategies by providing a more refined feature representation. This motivates us to design a more granular encoding process and explore more effective feature fusion approaches.

Thus, we propose a novel supervised segmentation network referred to as the Wide-scale Gated Fully Fusion Network (WGFFNet). Specifically, it incorporates a stepped fusion block (SFB) with multi-branch dense connections as the feature encoding building block, enabling the capture of finer local context at multiple scales. For decoding, we implement a progressive fused strategy that gradually incorporates low-level features while refining all information outflow using the Fully Gated Interaction Module (FGI). Additionally, a boundary difference loss term is introduced during network training to mitigate the boundary ambiguity and data imbalance problem. To verify the effectiveness of our proposed method, we conducted experiments on two public infrared small-target datasets SIRST-V2 [21] and IRSTD-1k [22]. The results show that our WGFFNet achieves superior performance compared with previous state-of-the-art (SOTA) methods.

The main contributions of this work are summarized as follows:

We proposed a segmentation network, WGFFNet, specially designed to enhance granular feature representation and fully leverage multi-level features. This approach ensures comprehensive context information and sufficient spatial information, significantly improving the detection performance for small targets.
We introduced the stepped fusion building block (SFB), which captures delicate local and global representation across each feature scale. Then we implement a progressive feature decoding strategy with the Fully gated Interaction Module (FGI) to hierarchically filter and refine target-specific information at each decoding stage.
We trained our network from scratch with the help of additional boundary difference loss on two public datasets. Comprehensive experiments, including ablation studies and comparison with the state-of-the-art (SOTA) methods are conducted and the results demonstrate the effectiveness and superiority of our WGFFNet.

The rest of this paper is organized as follows. Section 2 briefly reviews works related to our method. Section 3 makes a detailed description of our WGFFNet and each component. Experiments result and analysis is presented in Section 4, Section 5 and Section 6 gives conclusion remarks and future outlook of our work.

2. Related Work

2.1. Infrared Small Target Detection

Although small target detection methods for natural images have advanced significantly, the challenges posed by infrared images such as lack of color features and low resolution make it necessary to redesign specialized detection approaches for them. Among traditional methods, filter-based means design specific image filters to filter out targets, assuming that targets are outliers against continuous infrared backgrounds [5,6,23,24], which is only applicable when the background is simple and uniform. Local contrast-based means leverage target saliency within their local background [7,8,9,25,26,27], but are limited when targets are dim and blend with the background. Low-rank representation means modeled the task as a mathematical optimization problem using low-rank sparse component decomposition [10,11,12,28,29,30], but are computationally expensive and sensitive to hyperparameter tuning. Those means rely on prior assumptions and struggle with complex infrared backgrounds and variable target appearances.

The CNNs provide better and more robust detection performance through data-driven learning in IRSTD. Difference network architectures have been proposed aiming to address its difficulties. Networks like ACM [19], ALCNet [31] and AGPCNet [32] make complementary fusion of low-level spatial and high-level semantic information to enhance feature representation, while DNA-Net [20] and UIU-Net [33] use specially designed feature extracted backbones to generate sophisticated multi-scale features as well as adopt deep supervision training. Beyond that, targets’ edge reconstruction is further considered by ISNet [22] and MSAFFNet [34] to improve segmentation accuracy. Apart from CNNs structure, Transformers with superior performance in computer vision have also introduced to IRSTD. The IAANet employs a transformer encoder to model the interior attention of targets and background [35]. The progressive background-aware transformer (PBT) proposed by Yang et al. [36] design an asymmetric encoder-decoder structure utilizing self-attention mechanism. These novel approaches can capture global context well but often come with complex modelling and intensive computational cost.

By focusing on finer-grained local feature representation and effective multi-level feature fusion, our WGFFNet provides a more granular feature extraction process and a more efficient information integration strategy. Implicit attention to boundary reconstruction during network training further boosts detection performance while minimizing computational costs.

2.2. Context Modeling and Feature Fusion

Accurate small target segmentation requires both precise localization driven by low-level spatial information and reliable classification informed by high-level semantic information. Thus obtaining rich high-resolution contextual semantics is essential. The DeepLab series [18,37] adopt different rates of atrous convolutions to expand receptive fields while maintaining resolution. Non-local attention mechanisms [38,39,40] allow networks to capture the long-range dependencies across pixel positions to model the global context. Multi-scale feature fusion is another key strategy with basic fusion operators performing addition [41,42] or concatenation [17,43] across layers. Advanced approaches incorporate channel and spatial attention mechanisms to adaptively recalibrate features [44,45]. In addition to cross-layer fusion, some studies also explore small-scale fusion within the same layer to perceive finer local features. For instance, the Inception block [46] in single layer concatenates branches with different convolutional stacks to capture features at varying scales. Res2net [47] module hierarchically adds up connections between adjacent branches to enhance layerwise multi-scale feature representation. Subsequent works further improved the effectiveness of branches fusion along this way [48,49]. Beyond that, the development of Transformers [50] in computer vision has also brought new ideas. Pioneering works such as ViT [51] and DETR [52] outperforms in numerous visual downstream tasks. It adopts MLP blocks for channel mixing and self-attention mechanism for cross-location relation modelling, enabling powerful global context.

Inspired by these works, our stepped fusion block (SFB) employs dense intra-layer connections through elemental summation to capture precise spatial details, coupled with progressive inter-layer fusion to supplement semantic information.

2.3. Gating Mechanisms

Gate units typically involve a convolutional layer followed by a nonlinear function. The trained weights are used to regulate information flow. In the natural language processing field, networks like LSTM [53] and GRU [54] use gates to control information propagation. Back in the vision field, GBD-Net [55], designed for object detection, uses gates to facilitate information exchange across different supported regions. GSCNN [56] applies gates to share information between classic and shape encoding streams for better boundary information learning. Similarly, the GRB block [57] integrates RGB and depth signals through gate fusion, leaving each other with beneficial information.

In our work, we adopted the gated fully fusion module proposed by Li et al. [58]. It simultaneously fuses multiple levels of features using gates, where each level can be enhanced with the complementary spatial or semantic information from the rest. We apply this mechanism progressively during each decoding stage, aiming to filter out the most useful information for small targets while minimizing redundant noises to the greatest extent.

3. Methodology

3.1. Network Architecture

As illustrated in Figure 2, our WGFFNet consists of three main components: a locally enhanced feature encoder, a progressive fully gated fused decoder, and a segmentation head. Given an infrared image, the network first passes through the encoder to capture fine-scale details of small targets. Specifically, we modified Resnet-18 [59] by replacing the stem with three

3 \times 3

convolutions and all BasicBlocks with stepped fusion blocks (SFBs), while also reducing the channel capacity. The stem downsamples the original image size by a factor of 4, and each subsequent layer downsamples by a factor of 2, except for layer-1. The encoder’s details are provided in Table 1. This design enables each feature level to extract features at smaller scales, which is beneficial for the accurate localization of small targets. The stepped fusion blocks (SFBs) will be described in detail in the next section.

After obtaining four feature maps

[F_{1}, F_{2}, F_{3}, F_{4}]

at different resolutions from corresponding levels, the network proceeds along a top-down decoding path to progressively recover resolution. Starting with the deepest feature map

F_{4}

, each feature map is upsampled to match the size of the previous level and then jointly processed through the Fully Gated Interaction (FGI) Module. In the FGI, each feature map serves as the master branch individually, using gates to fuse useful information from other branches. This basic fusion is described as follows:

{\tilde{F}}_{3}^{(1)}, {\tilde{F}}_{4}^{(1)} = {FGI}^{(1)} (F_{3}, F_{4} ↑_{size (F_{3})})

(1)

where

{FGI}^{(s)}

and

{\tilde{F}}_{i}^{(s)}

denote the s-th stage of fully gated fusion, and

F_{i} ↑_{size (F_{j})}

represents the upsampling of feature map

F_{i}

to the size of

F_{j}

. We use

2 \times 2

transposed convolution as the upsampling operator in this process especially. The following outputs are further upsampled, along with the next upper-level feature to perform the same fully gated fusion. This process is repeated until

[F_{1}, F_{2}, F_{3}, F_{4}]

are all enhanced, which are expressed as

\begin{matrix} {\tilde{F}}_{2}^{(2)}, {\tilde{F}}_{3}^{(2)}, {\tilde{F}}_{4}^{(2)} & = {FGI}^{(2)} (F_{2}, {\tilde{F}}_{3}^{(1)} ↑, {\tilde{F}}_{4}^{(1)} ↑) \\ I_{1}, I_{2}, I_{3}, I_{4} & = {FGI}^{(3)} (F_{1}, {\tilde{F}}_{2}^{(2)} ↑, {\tilde{F}}_{3}^{(2)} ↑, {\tilde{F}}_{4}^{(2)} ↑) \end{matrix}

(2)

In this way, we obtain a set of comprehensive interacted feature maps

[I_{1}, I_{2}, I_{3}, I_{4}]

. This nested fully gated fusion strategy allows for progressive feature refinement with gate control at each stage, ensuring that the most critical representations are retained for small targets throughout the decoding process.

In the end, all enhanced features, which share the same spatial resolution, are concatenated and pass through the segmentation head to generate pixel-wise predictions for small-target segmentation. It is described as

F_{concat} = \cup (I_{1}, I_{2}, I_{3}, I_{4})

(3)

where ∪ denotes the concatenation operation. The integrated feature map then passes through a 3 × 3 convolution followed by batch normalization (BN) and the ReLU layer, which is denoted as

B (\cdot), δ (\cdot)

, respectively, to carry out smooth fusion on the results. Lastly, bilinear interpolation is applied to upsample the output to the original image size, resulting in the final prediction map P:

P = ({Conv}_{1 \times 1} (δ (B ({Conv}_{3 \times 3} (F_{concat}))))) ↑_{size (F_{orig})}

(4)

3.2. Stepped Fusion Block (SFB)

The receptive field of each pixel determines the scale of the regions it perceives [60]. A larger receptive field is beneficial for accessing global context, whereas a smaller receptive field excels at capturing fine-grained local details. The integration of multi-scale feature fusion with a wide range of receptive fields can result in more discriminative feature representation, which is essential for correctly distinguishing pixels. However, small targets in infrared images are often overlooked under large-scale receptive fields, resorting to the need for multi-scale feature fusion at a more granular level to maintain its delicate local information. This motivates the adoption of the stepped fusion block (SFB) in the first place.

As depicted in Figure 3, the SFB employs a stepwise fusion approach. For simplicity, the

k \times k

convolutions followed by batch normalization (BN) and the ReLU function are denoted as

C_{k} (\cdot)

.

After the

C_{1} (\cdot)

operation, the input feature map

f_{in}

is evenly split into N subsets, which flow through corresponding branches, denoted by

f_{i}

, where

i \in {1, 2, \dots, N}

and

N = 4

. On the i branch, the output is the sum of the output from the

i - 1

branch and its input. This hierarchical fusion is carried out s times, where

s \in {1, 2, \dots, N}

to integrate features with varying receptive field scales. The generalized fusion operation on the i branch at stage s is expressed as follows:

y_{i}^{(s)} = \{\begin{matrix} C_{3} (x_{i}^{(s)}) & i = s \\ C_{3} (x_{i} + y_{i - 1}^{(s)}) & i > s \end{matrix}

(5)

where

i \in {s, \dots, N}

, and

x_{i}

and

y_{i}

represent the input and output at the current stage, respectively. Specifically, at stage 1, each branch i receives input

f_{i}

, while in subsequent stages, the inputs are the outputs from the previous stage:

x_{i}^{(s)} = \{\begin{matrix} f_{i} & s = 1 \\ y_{i}^{(s - 1)} & s > 1 \end{matrix}

(6)

C_{k} (\cdot) = δ (B ({Conv}_{k \times k} (\cdot)))

(7)

After this dense fusion process, the final outputs from each branch

[y_{1}^{(1)}, y_{2}^{(2)}, y_{3}^{(3)}, y_{4}^{(4)}]

are concatenated and smoothed. Along with the skip connection to maintain gradient propagation, the output

f_{out}

of SFB is obtained as follows:

f_{out} = C_{1} (\cup (y_{1}^{1}, y_{2}^{2}, y_{3}^{3}, y_{4}^{4})) + f_{in}

(8)

3.3. Fully Gated Integration Module (FGI)

Inspired by the work of Li et al. [58], we incorporate the Fully Gated Integration (FGI) Module at each stage of the decoding process to integrate multi-level features. Assume that

F_{l} \in R^{C_{l} \times H_{l} \times W_{l}}

, where

l \in {1, 2, \dots, L}

represents the feature map with l level, which serves as the input of the FGI. Each l level feature is associated with a gated map

G_{l}

. The FGI operator is defined as

{\tilde{F}}_{l} = (1 + G_{l}) \cdot F_{l} + (1 - G_{l}) \cdot \sum_{i = 1, i \neq l}^{L} G_{i} \cdot F_{i}

(9)

where

{\tilde{F}}_{l}

is the output at the

l t h

level features, and · denotes element-wise multiplication. Each gated map

G_{l}

obtained through the gate unit is represented as

G_{l} = ϱ ({Conv}_{1 \times 1} (F_{l}))

(10)

In particular, the nonlinear function

ϱ

is modified to the Sigmoid Linear Unit (SiLU) for minimizing excessive loss of information. The detailed structure of the FGI is shown in Figure 4.

The FGI was designed for the holistic combination of features across all levels rather than limiting fusion to adjacent ones. The gates regulate the corresponding feature by enhancing its useful information and gathering complementary information from other levels, thus helping eliminate unnecessary redundancy and making fused features more discriminative. Our experiments demonstrate the effectiveness of the FGI in improving the detection performance of small targets.

3.4. Loss Function

There is a severe imbalance problem between positives (targets) and negatives (backgrounds) when segmenting small targets in infrared images. Region-based loss functions such as Soft-IoU loss [61] directly optimize the segmentation metric, effectively alleviating the imbalance and stabilizing the training process. Thus, we adopt the Soft-IoU loss, which is defined as

L_{soft - IoU} (p, g) = 1 - \frac{\sum_{i, j} p_{i, j} \cdot g_{i, j}}{\sum_{i, j} p_{i, j} + g_{i, j} - p_{i, j} \cdot g_{i, j}}

(11)

where

p_{i j} \in {[0, 1]}^{H \times W}

represents the predicted score value after applying the sigmoid function, and

g_{i, j} \in {0, 1}^{H \times W}

is for the ground truth label.

To draw the network’s attention to the small target boundaries, we additionally add the Boundary DoU loss [62], which emphasizes the boundary reconstruction to provide complementary information to the region-based Soft-IoU loss. The Boundary DoU loss is determined by calculating the ratio of the difference between the prediction and ground truth to the union of their difference and partial intersection:

L_{DoU} (p, g) = \frac{\sum_{i, j} p_{i, j} + g_{i, j} - 2 \cdot p_{i, j} \cdot g_{i, j}}{\sum_{i, j} p_{i, j} + g_{i, j} - (1 + α) \cdot p_{i, j} \cdot g_{i, j}}

(12)

In this equation, the hyperparameter

a l p h a

controls the partial union area, adaptively adjusting to the size of the targets. Large targets have a clear distinction between their interior and boundary, where the boundary accounts for a small proportion. Therefore, a larger

a l p h a

should be used to focus more on the boundary. Conversely, the small target boundary is often mixed within the interior, so a smaller

a l p h a

is better to focus on both parts simultaneously. Based on the analysis, the

a l p h a

is defined as

α = 1 - 2 \times \frac{C}{S}, α \in [0, 1)

(13)

where C and S represent the boundary area and the overall area of the target, respectively.

Figure 5 clearly illustrates the difference between the Soft-IoU loss and Boundary DoU loss. The Soft-IoU loss aims to align the entire union area, whereas the Boundary DoU loss focuses specifically on the partial union area, allowing for more refined segmentation when the prediction already matches most of the ground truth.

The final training object

L_{total}

is the sum of

L_{soft - IoU}

and

L_{DoU}

. It can handle the target segmentation task stably while taking care of boundaries. Experiments have shown that a proportion of 1:1 between the two losses can achieve the optimal state. Therefore, the overall loss function is expressed as follows:

L_{total} = L_{soft - IoU} + L_{DoU}

(14)

4. Experiments

In this section, we present experiments conducted to validate the effectiveness of our proposed WGFFNet. The specifics of the datasets, evaluation metrics, and training settings will be introduced. The comparison results with the state-of-the-art (SOTA) methods and network ablation study will also be presented to demonstrate the superiority of our method and the contributions of each component.

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

Our WGFFNet was evaluated on two publicly available datasets, SIRST-V2 [21] and IRSTD-1k [22], both of which have infrared images derived from the real world. SIRST-V2 is the updated version of the widely used SIRST dataset [19], with the addition of 597 infrared images containing more complex background interference. To stabilize the detection process, we removed 510 images containing only infrared background and divided the remaining 514 images into the training set and test set with a ratio of 8:2.

The IRSTD-1k dataset comprises 1000 infrared images featuring a broader range of infrared scenes with heavy disturbance, as well as further varied target types, presenting more challenges. To ensure sufficient training data, we reset the training-to-test ratio to 8:2 as well. During training, each image was first resized to 400 × 400, then randomly cropped to 384 × 384, and finally processed with Gaussian blur and normalization before feeding it into the network.

4.1.2. Evaluation Metrics

Several commonly used segmentation metrics were chosen for evaluation and comparison, including the pixel-level metrics such as Interaction over Union (IoU) and normalized IoU (nIoU) [19], target-level metrics such as Probability of Detection (Pd) and False-Alarm Rate (Fa), and the Receiver Operating Characteristic (ROC) curve to assess the classification ability of the models.

At the pixel-level evaluation, TP, FP, TN, FN, and N represent the number of true positives, false positives, true negatives, false negatives, and total samples in the sample set, respectively. Accordingly, the IoU is defined as follows:

I o U = \frac{\sum_{i}^{N} T P [i]}{\sum_{i}^{N} T P [i] + F P [i] + F N [i]}

(15)

It also denotes the ratio of the intersection and the union areas between the predictions and labels. nIoU, which represents the average IoU across all samples, is defined as follows:

nIoU = \frac{1}{N} \sum_{i}^{N} \frac{TP [i]}{TP [i] + FP [i] + FN [i]}

(16)

The ROC curve illustrates the relationship between the True Positive Rate (TPR) and the False Positive Rate (FPR) as the classification threshold

t \in [0, 1]

varies:

{TPR}^{t} = \frac{\sum_{i}^{N} {TP}_{i}^{t}}{\sum_{i}^{N} {TP}_{i}^{t} + {FN}_{i}^{t}}, {FPR}^{t} = \frac{\sum_{i}^{N} {FP}_{i}^{t}}{\sum_{i}^{N} {FP}_{i}^{t} + {TN}_{i}^{t}} .

(17)

At the target-level evaluation, metric Pd is defined as the ratio of correctly predicted targets to the total number of targets:

P_{d} = \frac{N_{pred}}{N_{all}}

(18)

Metric Fa is defined as the ratio of falsely predicted pixels to the total number of pixels in the image:

F_{a} = \frac{P_{false}}{P_{all}}

(19)

Following [20], we consider targets with centroid deviation of less than three pixels to be correctly predicted ones. Otherwise, the corresponding pixels of those targets are treated as falsely predicted.

4.2. Implementation Details

All deep learning methods were implemented using PyTorch on a server with an Intel Xeon Silver [email protected] GHz CPU and NVIDIA RTX A4000 GPU. The network was trained from scratch using the Adam optimizer, with an initial learning rate of 0.0012. The learning rate was reduced after each epoch following a polynomial decay strategy with a power of 2. The batch size was set to 8, and the total number of training epochs was 600. Traditional methods used for comparison were implemented in the MATLAB R2021a environment on an Intel [email protected] GHz CPU.

4.3. Comparison with State-of-the-Art Methods

We performed a comprehensive comparison of the proposed WGFFNet with nine related classic methods including both traditional and data-driven approaches through qualitative and quantitative evaluations. The selected traditional approaches included the TopHat filter [6] and the tri-layer local contrast measure (TLLCM) [8] of local difference means. Additionally, we considered low-rank sparse decomposition techniques, particularly non-convex rank approximation minimization (NRAM) [28] and the tensor optimization partial sum of the tensor nuclear norm (PSTNN) [29]. We applied an adaptive threshold, which was set to

T_{adaptive} = 0.5 \times \max_{value}

, to binarize the salient maps derived from these methods. The remaining data-driven approaches consisted of ALCNet [31], DNANet [20], ISNet [22], UIUNet [33], and transformer-based PBT [36]. Each of these was retrained on the same datasets as our WGFFNet.

4.3.1. Quantitative Results

Table 2 presents the comparison results of all methods on the SIRST-V2 and IRSTD-1k datasets, evaluated using multiple metrics. The best result of each metric is highlighted in bold. It can be seen that the following analysis holds. Firstly, advanced neural network methods were superior to conventional methods across most metrics on both datasets, which specifically exceeded the highest value of traditions by over 30% and 20% in the mIoU and nIoU, respectively. The great improvement of CNN-based methods reveals their better capability to locate all small targets accurately and restore their true shapes as much as possible.

It is worth noting that traditional methods based on local contrast measures or low-rank representation means have comparable performance to CNN-based methods in terms of metric Fa, suggesting that they are less prone to falsely identify target pixels. However, other lower metrics show that they struggle to distinguish confusing targets from complex backgrounds and often fail to capture all target pixels. This may be caused by the prior assumption about target characteristics such as local contrast distinctness or sparseness against backgrounds, as well as the fixed hyperparameter settings of the models, which constrain their applicability in real-world scenarios.

Secondly, our WGFFNet achieves the highest performance on most metrics among all networks, surpassing the second-best method by up to 1.45% in mIoU and 3.03% in Pd on both datasets. These results confirm the superiority of our network design. The robust feature representation achieved by finely mining local features and comprehensive feature aggregation has well balanced the dilemma between retaining spatial details and global semantic acquisition. Additionally, the network’s focus on boundary restoration during the optimization process promotes its ability to depict complete target shapes, further improving the overall segmentation performance.

Apart from this, it may be noticed that the UIUNet and PBT demonstrate a slight advantage over our methods in a few metrics. Given this, we compare the computational complexity and CPU inference time of all CNN-based methods in Table 3. It can be seen that UIUNet and PBT have a large model size and computational cost. In contrast, our network is relatively lightweight and efficient, achieving a more favorable balance between accuracy and speed.

Next, the ROC curves for all comparison methods on the SIRST-V2 and IRSTD-1k datasets are presented in Figure 6. Our network is labeled with dashed lines specifically, and area under the curve (AUC) values are indicated in the legend. It can be observed that our WGFFNet has outstanding classification ability compared to the other models. This confirms the robustness of our network in distinguishing small targets from complex backgrounds, as well as its ability to maintain consistent performance across varying conditions.

Furthermore, it can also be seen that all deep networks maintain a high level of correct predictions across varying false-alarm rates, whereas traditional methods, especially local contrast means, performed poorly in this regard. Such methods are constrained by predefined rules, which limit their ability to correctly identify targets even with few missed detections. Complex infrared backgrounds often deviate from these assumptions, making this method particularly sensitive in real applications.

4.3.2. Qualitative Results

As shown in Figure 7 and Figure 8, we selected three representative and challenging infrared scenes from the SIRST-V2 and IRSTD-1k datasets to visually compare the segmentation results of all methods. Each target inside is enlarged and displayed in the right corner for a clearer view. It can be observed that traditional methods often struggle with false detections or missed targets, especially when targets are extremely small or the background is cluttered and impure. This is clearly shown in Figure 8 of the IRSTD-1k dataset, which contains many challenging detection scenes. Data-driven methods generally perform better under this challenging conditions. However, some of them still encounter difficulties when facing noises or background elements that are similar to the targets, such as in the second image of Figure 7 and the first two images of Figure 8. Our WGFFNet benefits from the localized perception provided by a finer receptive field and complementary semantic information from hierarchically multi-level feature fusion, demonstrating a clear advantage in these scenarios. Its sufficient feature representation contributes to distinguish real targets from multiple interfering elements, resulting in the most accurate detection performance compared to the other methods.

Regarding the segmentation quality of the predicted targets, traditional methods tend to roughly locate their positions without depicting clear contours, while data-driven methods restore the true shapes of targets as much as possible. In comparing all data-driven methods on the extent of shape restoration further, our WGFFNet stands out by well outlining the target boundaries close to the ground truth. This is particularly evident in the last image of Figure 7 and Figure 8, where targets are relatively larger and have clearer edges. The addition of Boundary DoU loss allows the network to implicitly focus on the boundary during training, so that the edge of small targets can be well preserved in real detections.

4.4. Ablation Study

In this section, we conducted ablation experiments on both datasets to validate the effectiveness and contribution of each component within our model. Specifically, we analyzed the following aspects: (1) the impact of incorporating different backbone networks within the encoding path on the final performance; (2) the role of hierarchically multi-level feature fusion and the gated fusion mechanism within the decoding path; and (3) the effectiveness of combining Boundary DoU loss for target boundary optimization. For each validation term in the ablation study, we retrained the network followed by the same training strategy and hyperparameters as employed in the original WGFFNet.

4.4.1. Effectiveness of Stepped Fusion Block

Given that our network encoding path is built upon the ResNet-18 backbone, we compared the detection performance when different residual units are utilized within the backbone. Specifically, we compared the classic BasicBlock, the Bottle2neck of Res2Net [47], the stepped fusion block (SFB) without intra-fusion (dense summation between branches), and the original SFB. The results are presented in Table 4, and the best value of each is highlighted in bold.

Firstly, it can be observed that the SFB outperforms the other two blocks, indicating its superior feature representation capability within the same feature encoding structure when dealing with small targets. The BasicBlock, with only a single branch, and the Bottle2neck, which lacks sufficient receptive field refinement, demonstrate weaker performances. Secondly, the lower metric from the SFB without intra-fusion suggests that the dense connections between inner branches are essential. They expand the range of receptive field scales, which is beneficial for identifying dim and small targets.

4.4.2. Different Decoding Strategy

Next, we evaluated the benefits of our decoding strategy by testing three decoder variants to assess the roles of fully gated fusion and the progressive top-down inter-layer fusion approach. For the first variant, we replaced the gated fusion in each stage with a simple summation of features. The resulting multi-level feature maps were then upsampled to the input size and concatenated, like in the semantic segmentation branch used in the Panoptic FPN [63]. For the second variant, we changed the summation of features to concatenation, similarly to the U-Net architecture [17]. For the last variant, we omitted the full fusion of all current layers at each stage but directly applied gated fusions only between adjacent layers, which means that Equation (9) was modified as follows:

\tilde{F^{'}} = (1 + G_{l}) \cdot F_{l} + (1 - G_{l}) \cdot G_{l + 1} \cdot F_{l + 1}^{'}

(20)

This modification means that each layer only accepts gate control from the fusion result of the previous stage. We refer to these variants as baseline+summation, baseline + concatenation, and baseline + gated fusion, respectively.

Comparison results are shown in Table 5. They demonstrate that the optimal performance is achieved when both fully gated fusion and progressive summation are utilized. The gating mechanism effectively filters and preserves essential information, while the full integration across multiple stages ensures that the varying knowledge distilled from each stage is comprehensively utilized. Both elements are critical for achieving the superior detection performance of our network.

4.4.3. Combined Use of Two Loss Functions

We further explore the advantages of combining Soft-IoU loss and Boundary DoU loss. Specifically, we compare the effects of each independent loss function with their mutual loss trained on the SIRST-V2 dataset with WGFFNet. Figure 9 illustrates the loss progression during the network training, demonstrating that the combination of two losses stabilizes the training process and exploits the detection potential of WGFFNet.

Table 6 presents the optimal detection metrics derived from network training without independent loss function (training process shown in Figure 9). It can be seen that the addition of the Boundary DoU loss significantly improves the detection metrics, as attention to target boundaries helps enhance segmentation to some extent.

5. Discussion

Our experiments revealed that traditional methods are inherently limited by prior assumptions and fixed model settings, which restrict their adaption to complex and varying infrared scenes, making them struggle to maintain stable detection performance when confronting noisy backgrounds or indistinct small targets in the real world.

In contrast, data-driven methods mitigate these limitations by learning the intrinsic features of small targets from a large number of diverse backgrounds, enabling a higher detection accuracy and robustness. Detection in the format of semantic segmentation can facilitate the shape predictions of targets, benefiting the subsequent recognition task’s applications. In addition, the design of the network architecture affects the abilities of feature extraction and pixel classification. Our WGFFNet employs a novel stepped fusion block (SFB) to mine local details within the intra-layers, as well as a hierarchical inter-layer feature fusion scheme with a fully gated mechanism to integrate spatial and semantic knowledge, distilling out the optimal feature representation for identifying small-target pixels. The boundary-oriented network optimization further refines the shape depiction ability of the targets. Our network achieves superior detection performance by benefiting from the well-designed feature extraction and target restoration strategy, achieving balance between accuracy and computational efficiency.

While our network demonstrated improvements, it lacks in global context modeling or simple and effective feature fusion forms, resulting in some metrics not being the best. In observing the ROC curves, it is notable that our network does not stand out from UIUNet or PBT in terms of network robustness. This suggests that there is still room for improvement in the network’s ability in complex situations. We may consider further exploring the integration of Transformer modules to achieve robust context modeling capabilities, while also aiming to reduce computational costs during the process.

Broadly speaking, the data-driven methods used in IRSTD still have limitations and challenges. The generalization and detection capabilities of the IRSTD network heavily depend on the volume and quality of the available infrared datasets. Existing public datasets are limited in size and diversity, as well as their relatively poor mask labeling quality. This inspires studies to focus on expanding high-quality infrared datasets and exploring alternative labeling formats, such as bounding boxes. Moreover, the imbalance problem between positive and negative samples within small-target images tends to affect the network training and final detection performance, making this critical to consider.

In simpler scenarios with purer backgrounds and distinct targets, traditional methods may still hold an advantage due to their lower computational demands. In order to take advantage of the network’s strengths, lightweight network structures and efficient training strategies can be further explored. We will continue to focus on these issues in our future work.

6. Conclusions

In this paper, we proposed the novel Wide-scale Gated Fully Fusion Network (WGFFNet), specially designed for small-target segmentation in infrared images. Our network address the limitations of existing methods through several key innovations: (1) The stepped fusion block (SFB) embedded in the feature encoding stage enhances the network’s capability to capture finer local context across a wider range of receptive field scales, avoiding the loss of target details due to its small size. (2) The progressive feature fusion using the Fully Gated Integration (FGI) Module integrates rich spatial and semantic information, as well as selectively filters and refines target-specific information using gates to obtain the optimal feature representation for subsequent classification. Additionally, the introduction of Boundary DoU loss during network training further optimizes the edges of targets, improving the segmentation’s effects.

Comprehensive experiments were conducted on two public infrared small-target datasets, SIRST-V2 and IRSTD-1k, to verify the effectiveness and superiority of our WGFFNet. Compared with representative traditional methods and CNN-based methods, our experimental results, which are supported by both quantitative metrics and visual comparisons, demonstrate the superior detection performance and robustness of WGFFNet with small targets when facing complex and infrared-varying scenes. Ablation studies further confirmed the critical role of each component design within our network.

In the future, we will continue to focus on exploring more sophisticated feature fusion strategies and enhanced context modeling techniques, as well as on addressing the evolving demands for both accuracy and speed in infrared small-target detection.

Author Contributions

Conceptualization, Conceptualization, Y.W., X.W. and Z.L.; methodology, Y.W. and X.W.; validation, Y.W., X.C. and Y.Z.; formal analysis, Y.W.; writing–original draft, Y.W.; writing–review and editing, S.Q., X.W., W.Y., Z.L. and Y.W.; project administration, S.Q., X.W., C.Z., F.W., Z.S. and H.C.; funding acquisition, X.W., S.Q. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the CAS Project for Young Scientists in Basic Research (Grant No. YSBR-067); FFG and CAS Digital Technologies for Green—Bilateral Call (project: MARSHES); National Key Research and Development Program of China (Grant No. 2018YFB0504900, 2018YFB050490304).

Data Availability Statement

Two publicly available datasets were used to support this study. The SIRST-V2 dataset is available online https://github.com/YimianDai/open-sirst-v2, (accessed on 6 January 2025); The IRSTD-1k dataset is available online https://github.com/RuiZhang97/ISNet, (accessed on 6 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Teutsch, M.; Krüger, W. Classification of small boats in infrared images for maritime surveillance. In Proceedings of the 2010 International Waterside Security Conference, Carrara, Italy, 3–5 November 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1–7. [Google Scholar]
Ding, M.; Ding, Y.Y.; Wu, X.Z.; Wang, X.H.; Xu, Y.B. Action recognition of individuals on an airport apron based on tracking bounding boxes of the thermal infrared target. Infrared Phys. Technol. 2021, 117, 103859. [Google Scholar] [CrossRef]
Zhijian, H.; Bingwei, H.; Shujin, S. An infrared sequence image generating method for target detection and tracking. Front. Comput. Neurosci. 2022, 16, 930827. [Google Scholar] [CrossRef]
Zhang, W.; Cong, M.; Wang, L. Algorithms for optical weak small targets detection and tracking. In Proceedings of the International Conference on Neural Networks and Signal Processing, 2003, Nanjing, China, 14–17 December 2003; IEEE: Piscataway, NJ, USA, 2003; Volume 1, pp. 643–647. [Google Scholar]
Zhao, M.; Li, L.; Li, W.; Tao, R.; Li, L.; Zhang, W. Infrared small-target detection based on multiple morphological profiles. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6077–6091. [Google Scholar] [CrossRef]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small-target detection utilizing a tri-layer window. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
Wei, H.; Ma, P.; Pang, D.; Li, W.; Qian, J.; Guo, X. Weighted Local Ratio-Difference Contrast Method for Detecting an Infrared Small Target against Ground–Sky Background. Remote Sens. 2022, 14, 5636. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3737–3752. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Yu, Z.; Pan, N.; Zhou, J. SFFNet: Shallow Feature Fusion Network Based on Detection Framework for Infrared Small Target Detection. Remote Sens. 2024, 16, 4160. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 13668–13677. [Google Scholar]
Yuan, X.; Cheng, G.; Yan, K.; Zeng, Q.; Han, J. Small object detection via coarse-to-fine proposal generation and imitation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6317–6327. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-stage cascade refinement networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 877–886. [Google Scholar]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets, Denver, CO, USA, 20–22 July 1999; SPIE: Bellingham, WA, USA, 1999; Volume 3809, pp. 74–83. [Google Scholar]
Yang, L.; Yang, J.; Yang, K. Adaptive detection for infrared small target under sea-sky complex background. Electron. Lett. 2004, 40, 1. [Google Scholar] [CrossRef]
Deng, H.; Sun, X.; Liu, M.; Ye, C.; Zhou, X. Small infrared target detection based on weighted local difference measure. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4204–4214. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
Moradi, S.; Moallem, P.; Sabahi, M.F. Fast and robust small infrared target detection using absolute directional mean difference algorithm. Signal Process. 2020, 177, 107727. [Google Scholar] [CrossRef]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared small target detection via non-convex rank approximation minimization joint l 2, 1 norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Cao, Z.; Kong, X.; Zhu, Q.; Cao, S.; Peng, Z. Infrared dim target detection via mode-k1k2 extension tensor tubal rank under complex ocean environment. ISPRS J. Photogramm. Remote Sens. 2021, 181, 167–190. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef]
Tong, X.; Su, S.; Wu, P.; Guo, R.; Wei, J.; Zuo, Z.; Sun, B. MSAFFNet: A multi-scale label-supervised attention feature fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5002616. [Google Scholar] [CrossRef]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior attention-aware network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002013. [Google Scholar] [CrossRef]
Yang, H.; Mu, T.; Dong, Z.; Zhang, Z.; Wang, B.; Ke, W.; Yang, Q.; He, Z. Pbt: Progressive background-aware transformer for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5004513. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Zhang, X.; Zhou, X.; Zhou, E.; Sun, J. Learning delicate local representations for multi-person pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 455–472. [Google Scholar]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017.
Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-toend object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; Volume 1, p. 4. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Zeng, X.; Ouyang, W.; Yan, J.; Li, H.; Xiao, T.; Wang, K.; Liu, Y.; Zhou, Y.; Yang, B.; Wang, Z.; et al. Crafting gbd-net for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2109–2123. [Google Scholar] [CrossRef]
Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-scnn: Gated shape cnns for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5229–5238. [Google Scholar]
Qian, Y.; Deng, L.; Li, T.; Wang, C.; Yang, M. Gated-residual block for semantic segmentation using RGB-D data. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11836–11844. [Google Scholar] [CrossRef]
Li, X.; Zhao, H.; Han, L.; Tong, Y.; Tan, S.; Yang, K. Gated fully fusion for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 9–11 February 2020; Volume 34, pp. 11418–11425. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2016. [Google Scholar]
Rahman, M.A.; Wang, Y. Optimizing intersection-over-union in deep neural networks for image segmentation. In Proceedings of the International Symposium on Visual Computing, Las Vegas, NV, USA, 12–14 December 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 234–244. [Google Scholar]
Sun, F.; Luo, Z.; Li, S. Boundary difference over union loss for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 292–301. [Google Scholar]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6399–6408. [Google Scholar]

Figure 1. Comparison of large and small receptive field scales. Smaller scales tend to better perceive small targets.

Figure 2. Overview of the structure of the proposed WGFFNet.

Figure 3. Specific details of stepped fusion block (SFB).

Figure 4. Structure of Fully Gated Integration Module (FGI).

Figure 5. Comparison between Soft-IoU loss and Boundary DoU loss. The orange area represents the false predicted area that needs to be minimized, while the blue area represents the partial intersection area.

Figure 6. ROC curves of all methods on two datasets. (a) SIRST-V2 dataset. (b) IRSTD-1k dataset.

Figure 7. Visual results of different methods on the SIRST-V2 dataset. Targets to be detected are enlarged in the right-top corner. The blue, red and yellow circles indicates correctly detected, false detected and miss detected targets, respectively.

Figure 8. Visual results of different methods on the IRSTD-1k dataset. Targets to be detected are enlarged in the right-top corner. The blue, red and yellow circles indicates correctly detected, false detected and miss detected targets, respectively.

Figure 9. Comparison of training effect between independent losses and mutual loss.

Table 1. Details of locally enhanced feature encoder.

Stage	Downsample Factor	Block
Stem	4×	$(3 \times 3 Conv, 32) \times 3$
Layer-1	4×	$(SFB, 32) \times 2$
Layer-2	8×	$(SFB, 64) \times 2$
Layer-3	16×	$(SFB, 128) \times 2$
Layer-4	32×	$(SFB, 256) \times 2$

Table 2. Comparison results of mIoU (%), nIoU (%), Pd (%), and Fa (

\times 10^{- 6}

) across different methods on the SIRST-V2 and IRSTD-1k datasets. Higher values indicate better performance for mIoU, nIoU, and Pd, while lower values are better for Fa. The best results are highlighted in bold.

Table 2. Comparison results of mIoU (%), nIoU (%), Pd (%), and Fa (

\times 10^{- 6}

) across different methods on the SIRST-V2 and IRSTD-1k datasets. Higher values indicate better performance for mIoU, nIoU, and Pd, while lower values are better for Fa. The best results are highlighted in bold.

Method	SIRST-V2				IRSTD-1k
Method	mIoU	nIoU	Pd	Fa	mIoU	nIoU	Pd	Fa
TopHat	14.71	30.09	89.55	320.65	9.17	2.77	76.72	569.36
TLLCM	10.84	19.05	82.09	23.57	15.39	24.37	72.13	16.40
NRAM	18.86	32.06	61.94	13.63	22.77	29.70	64.26	12.42
PSTNN	31.80	39.03	82.84	28.71	21.01	30.60	68.52	98.82
ALCNet	68.13	65.20	93.28	50.29	61.31	56.73	90.33	14.18
DNANet	70.68	68.20	93.98	23.00	60.44	57.31	84.69	14.80
ISNet	72.17	68.57	92.48	30.52	60.07	56.58	85.34	17.84
UIUNet	73.43	68.96	94.02	16.80	63.01	58.64	88.78	2.48
PBT	72.83	70.75	92.53	27.69	63.21	58.39	90.10	1.70
Ours	74.88	68.85	94.78	10.62	63.44	57.81	93.36	3.76

Table 3. Computational complexity and inference time of all CNN-based methods.

Method	Para (M)	FLOPs (G)	Inference Time (s)
ALCNet	0.37	0.52	0.27
DNA	4.72	7.08	3.18
ISNet	1.10	15.35	1.45
UIUNet	50.54	42.65	0.69
PBT	26.54	33.48	4.69
Ours	2.77	2.01	0.48

Table 4. Impact of detection performance across different backbones on both datasets: mIoU (%), nIoU (%), Pd (%), and Fa (

\times 10^{- 6}

). The best result is highlighted in bold.

Table 4. Impact of detection performance across different backbones on both datasets: mIoU (%), nIoU (%), Pd (%), and Fa (

\times 10^{- 6}

). The best result is highlighted in bold.

Backbone	SIRST-V2				IRSTD
Backbone	mIoU	nIoU	Pd	Fa	mIoU	nIoU	Pd	Fa
Basicblock	68.69	67.50	92.54	77.67	61.88	54.32	84.72	6.87
Bottle2neck	70.00	64.94	90.30	51.15	61.48	55.57	92.69	13.96
SFB w/o intra-fusion	69.60	64.77	91.04	52.06	59.91	55.12	91.69	19.31
SFB	74.88	68.85	94.78	10.62	63.44	57.81	93.36	3.76

Table 5. Evaluation of the impact of different decoding strategies on the detection performance on both datasets. mIoU (%), nIoU (%), Pd (%), and Fa (

\times 10^{- 6}

). The best result is highlighted in bold.

Table 5. Evaluation of the impact of different decoding strategies on the detection performance on both datasets. mIoU (%), nIoU (%), Pd (%), and Fa (

\times 10^{- 6}

). The best result is highlighted in bold.

Decoding Strategy		SIRST-V2				IRSTD-1k
Decoding Strategy		mIoU	nIoU	Pd	Fa	mIoU	nIoU	Pd	Fa
Baseline+	summation	64.48	64.14	91.80	107.22	59.28	54.70	94.68	35.70
	concatenation	62.10	63.56	93.28	119.42	58.20	53.86	96.01	45.49
	gated fusion	65.85	64.87	91.69	94.24	59.59	53.03	87.71	5.38
Our Decoding		74.88	68.85	94.78	10.62	63.44	57.81	93.36	3.76

Table 6. Optimal detection metrics derived from network training without independent loss function.

Loss Function Used	mIoU	nIoU	Pd	Fa
w/o Soft-IoU loss	68.16	63.92	88.06	41.63
w/o Boundary DoU loss	72.86	66.44	91.79	36.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, X.; Qiu, S.; Chen, X.; Liu, Z.; Zhou, C.; Yao, W.; Cheng, H.; Zhang, Y.; Wang, F.; et al. Multi-Scale Hierarchical Feature Fusion for Infrared Small-Target Detection. Remote Sens. 2025, 17, 428. https://doi.org/10.3390/rs17030428

AMA Style

Wang Y, Wang X, Qiu S, Chen X, Liu Z, Zhou C, Yao W, Cheng H, Zhang Y, Wang F, et al. Multi-Scale Hierarchical Feature Fusion for Infrared Small-Target Detection. Remote Sensing. 2025; 17(3):428. https://doi.org/10.3390/rs17030428

Chicago/Turabian Style

Wang, Yue, Xinhong Wang, Shi Qiu, Xianghui Chen, Zhaoyan Liu, Chuncheng Zhou, Weiyuan Yao, Hongjia Cheng, Yu Zhang, Feihong Wang, and et al. 2025. "Multi-Scale Hierarchical Feature Fusion for Infrared Small-Target Detection" Remote Sensing 17, no. 3: 428. https://doi.org/10.3390/rs17030428

APA Style

Wang, Y., Wang, X., Qiu, S., Chen, X., Liu, Z., Zhou, C., Yao, W., Cheng, H., Zhang, Y., Wang, F., & Shu, Z. (2025). Multi-Scale Hierarchical Feature Fusion for Infrared Small-Target Detection. Remote Sensing, 17(3), 428. https://doi.org/10.3390/rs17030428

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Hierarchical Feature Fusion for Infrared Small-Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Infrared Small Target Detection

2.2. Context Modeling and Feature Fusion

2.3. Gating Mechanisms

3. Methodology

3.1. Network Architecture

3.2. Stepped Fusion Block (SFB)

3.3. Fully Gated Integration Module (FGI)

3.4. Loss Function

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.3.1. Quantitative Results

4.3.2. Qualitative Results

4.4. Ablation Study

4.4.1. Effectiveness of Stepped Fusion Block

4.4.2. Different Decoding Strategy

4.4.3. Combined Use of Two Loss Functions

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI