DyHead is a target detection framework based on an attention mechanism. It can adaptively adjust the receptive field size to adapt to the scale change in different targets and improve the detection ability of targets at different scales. Meanwhile, DyHead utilizes a shared feature extraction network to reduce the number of computations and parameters.
The coupled head used in YOLOv5 has a simple design idea, which requires a large number of parameters and computational resources and is prone to overfitting. The decoupled head can extract the target location and category information separately, learn them through different network branches separately, and finally fuse them. It can effectively reduce the number of parameters and computational complexity and enhance the generalization ability and robustness of the model.
2.1.1. Designing a New Loss Function NMIoU
Small target detection is a very challenging problem in target detection in pest data where the pests are only a few pixels in size in the image. As the appearance information of small target pests is more difficult to detect, the current state-of-the-art detectors cannot get better results on small targets. We can also see from our experiments that loss functions such as IoU [
27], CIoU [
28], and DIoU are very sensitive to the positional deviation of small targets and greatly reduce the detection performance when they are used in anchor-based detectors. In order to solve this problem, we propose a loss function based on the combination of the NWD metric and the MPDIoU loss function-NMIoU, where the MPDIoU loss function is schematically shown in
Figure 2.
The loss function used in YOLOv5 is CIoU, which takes into account the distance between the target frame and the Anchor, the overlap, the scale, and the penalty term to make the target frame regression more stable. However, CIoU suffers from high computational complexity, does not take into account the case where the correct frame has the same aspect ratio as the predicted frame, is highly sensitive to the target frame, and is not applicable to large-scale differences.
The CIoU loss function cannot be optimized when the predicted box has the same aspect ratio as the actual labeled box, but the width and height values are completely different. MPDIoU can solve the above problem well; it is a comparative measure of bounding box similarity based on the distance of the minimum point, which takes into account all the relevant factors considered in the existing loss functions, such as the overlapping or non-overlapping area, the distance of the centroids, and the deviation of width and height. MPDIoU is determined by the four vertices of the bounding box in the upper left corner (
,
), upper right corner (
,
), lower left corner (
,
), and lower right corner (
,
). The general bounding box is determined by the center coordinates (
,
) and the width
and height
. For this reason, we can calculate the four vertex coordinates as shown in the formula. The four vertex coordinates are obtained [
29], and the distance between the two sets of vertices is calculated, and the minimum of these distances is chosen
. Then, the ratio of the overlapping region of the predicted and real boxes to their concatenated region is calculated. The final MPDIoU loss function can be defined as Equation.
In Equation, , , , are the coordinates of the four vertices of the bounding box, is the coordinate of the center of the bounding box, is the width of the bounding box, and is the length of the bounding box.
The NWD metric is introduced to address the problem of high sensitivity to target boxes and its inapplicability to large-scale differences. The traditional IoU and its variants are discarded for the boxes with low confidence in the bounding box because, in these bounding boxes, the foreground and background pixels are concentrated on the center and the boundary of the bounding box, respectively, which leads to the low useful information in the box, and to this point, it affects the low confidence. NWD solves this problem well by assigning different weights to the pixels in the bounding box, with the center having the highest weight and decreasing from the center to the boundary. For this purpose, the bounding box is modeled as a two-dimensional Gaussian distribution, and then the distance of the distribution is calculated by the Wasserstein distance. The two-dimensional Gaussian distribution of the bounding box is defined by the centroid (
,
), width
, and height
. For the two-bounding box
,
Gaussian distributions, the Wasserstein distance can be obtained by calculating the difference between their means and covariances, and finally, the value domain of the Wasserstein distance is restricted to be between 0 and 1 by normalization, and the normalization factor
is the maximum value of the two maximum values of the diagonal lengths of the bounding box.
In addition, NWD integration into YOLO is improved based on the label assignment, NMS, and regression loss function of IoU. In the NWD-based label assignment strategy, for training RPNs, positive labels will be assigned to two types of anchors: anchors with the highest NWD values and NWD values greater than
and anchors with NWD values higher than the positive threshold
for any real frame. Therefore, negative labels will be assigned to anchors if their NWD values are lower than the negative threshold
for all real frames. In addition, anchors with neither positive nor negative labels assigned do not participate in the training process. where
and
are the original detectors. NMS is based on NWD. First, it sorts all the prediction frames based on their scores. The highest-scoring prediction frame M is selected, and all other prediction frames with significant overlap with M are suppressed. This process is recursively applied to the remaining boxes. Based on the regression loss of NWD, IoU-Loss is not able to provide gradients for the optimization network in some cases, and for this reason, the NWD metric is designed as a loss function:
As mentioned above, the NMIoU loss function is obtained by combining MPDIoU and NWD, and we just need to give reasonable balance coefficients to the NMIoU loss function to regulate the loss weights of MPDIoU and NWD. The formula for the NMIoU loss function is as follows:
In Equation, λ is the equilibrium coefficient that regulates the loss weights of MPDIoU and NWD.
2.1.2. Adding a DyHead with an Attention Mechanism
The detection head is a crucial component of the target detection model, whose role is to process the features of the last layer of the network and generate the results of target detection. In the target detection task, the detection head assumes several important roles. First, the detection head is responsible for target localization. The location information of the target frame is predicted through regression. Second, the detection head is also responsible for target classification. By predicting the category to which the target belongs through a classifier, the model is able to identify and label different objects in the image through the classification function of the detection head. In addition, the detection head is also responsible for generating the target score. The target score is used to measure the confidence or importance of each target in the detection results. This helps in filtering out the targets with high confidence levels, thereby improving the accuracy and reliability of the detection results. The role of the detection head in the target detection task is critical and indispensable. Through the functions of target localization, target classification, and target score generation, the detection head is able to achieve accurate localization and classification of targets in an image, providing important target detection results for real-world application scenarios.
For this reason, this paper introduced the detection head DyHead in YOLOv5, which contains an attention mechanism. Dyhead consists of a scale-aware attention module, a spatial-aware attention module, and a task-aware attention module. When the scale-aware attention module processes the original feature map F
1, it performs a global average pooling operation on the feature map, which aggregates features on different scales to obtain a multi-scale global representation. Subsequently, the above multi-scale feature maps are integrated using 1 × 1 convolution to fuse the features on different scales. Relu and Hard Sigmoid functions are used to enhance the nonlinear representation of the model for the multi-scale feature maps. Finally, the computed weights are multiplied with the original feature map to obtain the new feature map F
2. When the Perceptual Attention module processes F
2, the Index function extracts the positional information of the feature map, and subsequently, the extracted positional information is processed by a 3 × 3 deformable convolution and used to adjust the offsets of the convolution kernel, the sparse sampling, and the aggregation of the feature elements by the Sigmoid function. The offset function uses the obtained offsets to increase the weight share of the target shallow contour and edge information in the network. Finally, the acquired position information is appended to a new feature map F
3. F
3 is fed into the task-aware attention module, which reduces the spatial dimensionality of the feature map through an average pooling operation, followed by two fully connected layers to learn the inter-channel relationships. After the fully connected layers, the ReLU activation function is introduced to enhance the nonlinear representation of the features, and the scaling of the features is adjusted by a normalization operation. The normalization uses the scaling factor and offset factor in Batch Normalization for task-awareness, and the output of the task-aware attention module is used to obtain the desired feature map F
4 through a composite operation, as shown in
Figure 3.
It uses the attention mechanism to unify the different target detection heads. As can be seen in
Figure 4, the initial features are noisy due to domain differences, and for this reason, they cannot focus on the target well; firstly, after the original features are processed by the scale-aware attention module, the features become more sensitive to the targets at different scales; secondly, when the feature maps processed by the scale-aware attention module are processed by the spatial location-aware attention module, the features become more sparse, focusing on the foreground targets at different locations; Finally, after processing by the task-aware attention module, the features will form different activations based on different downstream tasks through the attention mechanism between feature levels for scale perception, between spatial locations for spatial perception, and within the output channel for task perception, as shown in
Figure 4.
In Equation, is the number of layers, is a linear function with 1 × 1 convolutional approximation, is the hard_sigmoid activation function, is the number of sparsely sampled positions, is the weight of the convolutional kernel, is the positional offset, is the spatial offset, is a self-learnable importance metric factor with respect to the position , is the slice of the feature tensor on a particular channel , and the weights obtained by network learning, and and are bias terms.
DyHead is introduced into YOLOv5, which is applied to a one-stage detector. The scale-aware attention module, the spatial location-aware attention module, and the task-aware attention module are combined as a group, and then the number of their cycles is selected for image processing, as shown in
Figure 5.
Normalization in DyHead uses the configuration of Group Normalization [
30] (GN). GN aims to address the problem of performance degradation of Batch Normalization [
31] (BN) on small batches of data or when the batch size varies, especially in non-convolutional layers of RNNs and CNNs. The core idea of GN is to perform normalization on each training batch to divide the channels in each feature map of the network into groups and normalize the channels within each group. Doing so reduces the model’s dependence on the batch size while maintaining the efficiency and effectiveness of the normalization operation. Specifically, the channels of the feature map are divided into G groups, each containing C/G channels. If the size of the feature map is [N, H, W, C], where N is the batch size, H and W are the height and width of the feature map, and C is the number of channels, then each group will contain C/G channels; the mean and variance are calculated for the channels within each group, and the channels in each group are normalized using these statistics. The normalization formula is as follows:
In Equation, is the input to the g layer of the network, is the mean of feature i for all samples in the current batch g, is the variance of feature for all samples in the current batch , and is a very small constant that prevents dividing by zero and ensures numerical stability. and are the learnable scaling and offsetting parameters.
The final normalized feature map is fed into the next layer of the network. It does this by calculating the mean and variance of all the samples in each batch and then using these statistics to normalize the inputs for the current batch. This results in a more stable distribution of inputs to the network and reduces the internal covariance bias, thus allowing the use of larger learning rates, faster training, and less sensitivity to initialization weights.
2.1.3. Adding Decoupled Head to Head
There is no decoupled head in YOLOv5, only a coupling header. The coupling head typically requires the feature map output from the convolutional layers to be fed directly into several fully connected or convolutional layers in order to generate outputs for the target locations and categories. Such a design brings a lot of parameters and computational resources to the model, and the model is also prone to overfitting. In contrast, a more efficient decoupled head structure is designed in YOLOv6 with the help of a hybrid channel strategy, which reduces the number of 3 × 3 convolutional layers in the decoupled head of YOLOX to only one. The width of the head is jointly scaled by the width multipliers of the Backbone and Neck. The delay is reduced while maintaining accuracy, mitigating the additional delay overhead associated with the 3 × 3 convolution in the decoupled head. This is shown in
Figure 6.
In this paper, with the help of YOLOv6’s decoupled head idea, the decoupled head is introduced for YOLOv5. The feature maps of P3, P4, and P5 in YOLOv5 are inputted, and the feature maps are processed by a 1 × 1 convolution and divided into two branches, and then the processed features are all processed by a 3 × 3 convolution, and the final output of the feature maps is processed by another 1 × 1 convolution, and the classification maps are obtained, respectively, a coordinate position map and a target frame confidence map.