3.2. Backbone Network
The proposed backbone network model primarily includes convolutional layers and RepNCSPELAN4 components. Each stage block contains a Conv block with a stride of 2 and a RepNCSPELAN4 module (
Figure 3a) for down-sampling and feature extraction. The most critical component for feature map extraction in the backbone is the RepNCSPELAN4 module. This design integrates the RepConv structure and draws inspiration from Cross-Stage Partial Network (CSPNet) and ELAN [
26], generally adopting a four-branch skip-layer connection and split operation for feature extraction. Replacing the original C2F module in YOLOv8 with this design has yielded significant improvements.
In YOLOv9 [
27], the authors introduced an improved Generalized Efficient Layer Aggregation Network (GELAN), which integrates the CSPNet and ELAN models to optimize gradient pathways for enhanced feature information propagation and aggregation. The design principle of CSPNet involves partitioning the feature map into two segments: one undergoes convolution. In contrast, the other merges with the upper-layer convolution results through a cross-stage method. This separation of gradient flow reduces computational complexity and enriches branch fusion information. ELAN, by contrast, enhances gradient flow and aggregates features from multiple levels, ensuring that each layer incorporates a “Resnet” pathway. This design improves the model’s receptive field and feature representation capabilities, effectively mitigating the challenges associated with increased training difficulty. The GELAN significantly reduces computational load and parameter count while maintaining detection performance.
We draw inspiration from the GELAN module proposed in YOLOv9, employing the concepts of segmentation and recombination while introducing a hierarchical processing approach. This design enhances feature extraction capabilities alongside network channel expansion. Consequently, we developed the RepNCSPELAN4 feature extraction module, which leverages comprehensive gradient flow information to improve the extraction of features related to small targets for detection tasks. Specifically, the module incorporates a four-branch skip layer connection and split feature extraction operations, allowing RepNCSPELAN4 to be defined as follows:
In this module, C1 serves as the input, which passes through the layer of Conv1 and produces an output of C3. Subsequently, a split convolution operation is performed, where split1(0.5C3) and split2(0.5C3) represent the first two segments of the Conv output, divided along the channel dimension, with no further operations applied before outputting these segments. The outputs generated by Conv2 (C4) and Conv3 (C4) result from independent convolution and pooling operations applied to their respective channels. Finally, the four segments are concatenated along the dimension to produce the final output features. The specific process is delineated as follows:
First, in the initial two branches, namely split
1 and split
2, the absence of any operational transformations leads to a direct output equivalent to 2 × 0.5
C3. In the subsequent feature extraction modules, the third branch undergoes processing through the Re-parameterization-Net with Cross-Stage Partial (RepNCSP
1) submodule (
Figure 3b). RepNCSP
1 divides the input features into two segments: one segment undergoes feature extraction via the Re-parameterization Bottleneck without Identity Connection (RepNBottleneck), while the other employs conventional convolution operations. These two segments are concatenated to augment the network’s feature extraction capabilities and performance, enhancing its overall efficacy in capturing relevant information. The architecture of the RepNBottleneck submodule (
Figure 3c) references the ResNet structure, where one branch maintains the output channel count at
C4/2. In contrast, the other branch is processed through the Re-parameterization Convolution without Identity Connection (RepConvN) submodule (
Figure 4). The core idea of RepConvN is to utilize two distinct convolution kernel sizes, 3 × 3 and 1 × 1, for feature extraction. During the inference phase, structural re-parameterization merges the 1 × 1 and 3 × 3 convolution kernels into a single 3 × 3 kernel. Specifically, the 1 × 1 kernel is padded to match the size of the 3 × 3 kernel, allowing the padded kernel to be added to the original 3 × 3 kernel based on the principle of additivity of kernels of the same size, thereby forming a 3 × 3 convolution kernel for inference. Applying RepConvN within the RepNBottleneck submodule enhances the model’s efficiency and performance. After the nested RepNBottleneck feature extraction is completed, the process returns to the RepNCSP
1 submodule, where N operations of the RepNBottleneck submodule are sequentially executed. The output from this module is then concatenated with the original convolution channel, resulting in an output feature size of
C4 after passing through Conv
2. Subsequently, the output features from the third branch serve as input for the fourth branch, which again enters the RepNCSP
2 submodule, repeating the operations above. Ultimately, the output features from the RepNCSP
2 submodule, after passing through Conv
3, also yield a size of
C4.
The final output consists of four distinct channels: split1 and split2, each representing the initial two outputs with channel dimensions of 0.5 C3, respectively. These are subsequently processed through the RepNCSP1 submodule and Conv2 block, leading to the third output feature with C4 channels. Further, the fourth path involves the RepNCSP2 submodule and Conv3 block, resulting in an output feature of C4 channels. Ultimately, these four channels undergo a Concat operation, yielding a final output feature of C1 = 0.5 C3 + 0.5C3 + C4 + C4 = C3 + 2 × C4. This module’s innovative design lies in its capability to amplify channel dimensions while effectively learning multi-scale small object features and expanding the receptive field through intra-module feature vector splitting and multi-level nested convolutions. This approach not only enhances the network’s efficiency but also addresses the issue of excessive parameter counts in the original C2F module of Yolov8, which arises from fusing features from different hierarchical levels. Consequently, the proposed module achieves a comprehensive improvement characterized by reduced parameter counts, heightened detection accuracy, and better training generalization. It offers a more streamlined and efficient alternative to conventional methods, thereby contributing to advancements in object detection performance.
3.3. Neck Network
In YOLOv8, the neck network maintains the PAFPN structure, which can fuse feature maps of different scales and provide richer feature representations. At this point, the C2f module simultaneously fuses low-resolution and high-resolution feature maps to enhance detection accuracy. In recent years, attention mechanisms have been widely introduced into object detection architectures to optimize models and achieve significant results. Through learning and model training, deep neural networks can learn which regions need specific attention in each new image, forming the necessary attention. Among these, self-attention, spatial attention, temporal attention mechanisms, and branch attention are the most typical attention mechanisms. Therefore, adding attention mechanisms in the neck not only allows the network to focus more on the target regions and model them more finely but also helps the network focus on edge, texture, and other detail information of small target objects in the image data, thereby improving the overall recall and precision of object detection.
The Convolutional Block Attention Module (CBAM) is an attention mechanism used in computer vision tasks, particularly suitable for Convolutional Neural Networks (CNNs). The principle underlying the CBAM involves the fusion of channel and spatial attention, considering the significance of features in both the channel and spatial dimensions. This approach refines the input feature map in two stages. By jointly using these two attention mechanisms, it can better capture the key features of small targets in images. Fundamentally, the channel attention mechanism first focuses on “which channels are important”, using parallel operations of average pooling and max pooling to integrate the spatial information in the input feature map, obtaining dual feature maps. These dual feature maps are then fed into a shared multilayer perceptron, adding the output features of the two multilayer perceptrons one by one and generating the channel attention map through a sigmoid activation function. The spatial attention mechanism first focuses on “where are important”, performing parallel operations of global max pooling and global average pooling at the channel level on the input feature map to obtain a pair of feature maps. Next, these feature maps are concatenated along the channel axis and convolved to reduce parameter count. Subsequently, a sigmoid operation generates spatial attention features. This mechanism adapts to improve the model’s focus on critical features, enhancing recognition ability. More importantly, the CBAM is a lightweight, universal module that enhances training efficiency without adding computational burden to the network, making it simple and efficient.
Therefore, this paper also designs a multi-spatial channel-enhanced C2FCBAM module in AID-YOLO, as shown in
Figure 5, aiming to improve the detection of small objects. As shown in
Figure 4, the neck network C2f module plays a crucial role in the CSP Bottleneck structure. Through feature transformation, branch processing, and feature fusion operations, it can extract and transform the features of the input data, generating more representative outputs. It aids in enhancing the performance and representation capability of the network, thereby facilitating its improved adaptability to intricate data tasks. Hence, in the CBAM attention mechanism, based on assigning convolutional attention weights in both spatial and channel dimensions, we cascade and enhance the two CBAM modules and combine them with the Bottleneck module at the neck network’s C2F part. The Bottleneck module is still based on two convolutional modules, first passing through the initial convolutional layer and then replacing the second convolutional layer with the second nested and MSCE-CBAM. This internal and external nested double residual information linkage further enhances the model’s ability to focus on crucial attributes of the detection object, improving detection performance.
3.4. Head Network
The detection samples in this dataset primarily focus on small targets. Tiny objects generally have dimensions smaller than 16x16 pixels, providing extremely restricted visual information. The elevated complexity imposed on the network model hampers its ability to acquire discriminative features for detecting diminutive targets, resulting in an elevated rate of missed detections. Currently, mainstream bounding boxes adopt BCE (Binary Cross-Entropy) as the classification loss and IoU (Intersection over Union) as the regression loss. Despite many modifications, IoU’s sensitivity to objects of different scales varies greatly, as shown in
Figure 6. For example, with small objects in the recognition dataset, minor positional deviations can cause significant IoU drops, resulting in inaccurate positive and negative sample label assignments. However, the Intersection over Union (IoU) demonstrates minimal variations for larger objects, suggesting discretization of IoU measurements when accounting for objects of diverse scales and positional deviations. Therefore, using the IoU series as a loss function for small object detection models can lead to insufficient feature information feedback for small targets, causing the model to focus only on larger targets while neglecting small target feature learning, making it difficult for the model to converge. Although the CIoU loss function combines the characteristics of Generalized Intersection over Union (GIoU)and IoU loss functions, considering the area and center distance of the bounding box. Since CIoU is designed based on the object’s area, the influence of larger objects becomes more significant, potentially leading to excessive correction for smaller objects. Thus, using different loss weights for recognition targets of varying sizes to improve this issue is crucial.
We propose a new weighted regression cost function, NWD-CIoU_Loss, based on bounding box loss to solve this problem. This function integrates the NWD_loss function, specifically designed for small targets, with the CIoU function. By allocating a weight distribution coefficient to modify the relative significance of the two losses, the detection performance for minuscule objects is enhanced. The NWD loss function comprises the following steps:
First, to better describe the weight of different pixels within the bounding box, the bounding box is modeled as a two-dimensional Gaussian distribution. The calculation process is as follows: The horizontal bounding box
R = (
cx,
cy,
w,
h), where (
cx,
cy),
w, and
h represent the center coordinates, width, and height. Its inscribed ellipse equation is
In the equation, (
ux,
uy) are the center coordinates of the ellipse, and
δx and
δy are the semi-axis lengths along the x and y axes. Thus,
ux =
cx,
uy =
cy,
δx =
w2, and
δy =
h2,
δy = h
2. The probability density function of the two-dimensional Gaussian distribution is as follows:
where (
x,
y) represents the coordinates of the Gaussian distribution, u is the mean vector, and Σ is the covariance matrix.
Next, the distribution distance is calculated using the Wasserstein distance from optimal transport theory. For two-dimensional Gaussian distributions,
µ1 =
N (
m1,
Σ1) and
µ2 =
N (
m2,
Σ2), the second-order Wasserstein distance between
μ1 and
μ2 is defined as follows:
where
denotes the Frobenius norm.
For bounding box modeling, the Gaussian distributions modeled by the bounding boxes
A = (
cxa,
cya,
wa,
ha) and B = (
cxb,
cyb,
wb,
hb) modeled Gaussian distribution
Na,
Nb, can simplify the second-order Wasserstein distance to
Since
W2(
Na, Nb) is a distance metric, and the original IoU is a similarity metric for bounding boxes, a new metric called the normalized Wasserstein distance (NWD) is obtained through exponential normalization:
Here, C represents a constant closely tied to the dataset, conventionally set to 12.8.
Finally, since IoU_Loss cannot provide gradient transformations for optimizing the network when there is no overlap between the predicted bounding box
P and the ground truth
G (i.e.,
P∩
G = 0) or when P and G are mutually inclusive (i.e., ∣
P∩
G∣ =
P or
G), the NWD metric is designed as a loss function to better deal with small object detection as follows:
where
NP is the Gaussian distribution model of the predicted box
P, and
Ng is the Gaussian distribution model of the ground truth box
G. In summary, considering the inconsistent distribution of targets of different scales in ground-based scenes, the ratio of NWD to IoU metrics is set to
α:
β to achieve better detection of targets of diverse scales. The final bounding box regression loss function is as follows:
where
α +
β = 1 represents the adjusted weight range of the bounding box loss, considering that small objects are more sensitive to displacement from the center point, and
α is set to be greater than
β in the experiments.