5.1. YOLOv5 Model’s Specific Modifications and Performance Testing
The YOLO series [
18,
19,
20,
21,
22] is a one-stage object detection algorithm characterized by a framework comprising four key components: the input section, the trunk section, the neck section, and the prediction section. Notably, YOLOv5 distinguishes itself from other models within the YOLO series by its streamlined architecture, which facilitates enhanced detection speed and greater flexibility in deployment. A significant feature of YOLOv5 is its implementation of mosaic data augmentation, which enhances the dataset’s diversity by combining four randomly selected images. The model employs a feature extraction structure that includes CSPDarknet53 and an SPP layer, with the extracted features subsequently integrated using PANet [
23]. YOLOv5 is available in five distinct variants: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The variant examined in this study is an improved version of YOLOv5s, primarily tailored for object detection in intricate deep-sea environments.
Figure 4 illustrates the structural design of the improved YOLOv5 model.
Through experiments, it has been found that the YOLOv5s model still exhibits low accuracy when applied to the identification of manganese nodule datasets. Therefore, the model has been modified in the following two aspects:
(1) The C3 module is improved to the C2f module.
The architecture of the C2f (cross-stage partial network) module [
24] has been enhanced based on the C3 module. Certain aspects of its design bear a resemblance to the ELAN architecture, which facilitates the parallelization of multiple gradient flow branches, thereby enabling the acquisition of more comprehensive gradient flow information while maintaining a lightweight model. This approach aims to enhance both the training efficacy and convergence speed of the model. To further minimize computational demands and improve efficiency, the number of Bottleneck input channels in the C2f module has been reduced to 50% of that in the preceding level, and the kernel size of the initial convolutional layer has been modified from 6 × 6 to 3 × 3. Additionally, a segmentation has been introduced between the CBS and Bottleneck components to partition the feature map.
The structure of the C2f module is illustrated in
Figure 5. Within this module, the Bottleneck component consists of two 1 × 1 convolutional layers and one 3 × 3 convolutional layer. The primary function of the two 1 × 1 convolutional layers is to modify the dimensions of the input and output layers, thereby establishing the 3 × 3 convolutional layer as the bottleneck in terms of input/output dimensions. This Bottleneck architecture effectively decreases the number of parameters and enhances computational efficiency while maintaining the original level of accuracy. Through a series of experiments involving various placements of the C2f module and differing quantities of replacements, it was determined that the optimal
value was attained by substituting the initial C3 module of the Darknet53 network backbone with a C2f module.
(2) NWD + IoU double loss function
To address the challenge of balancing the computation of the loss function for targets of varying scales, we propose the NWD+IoU double loss function. The IoU (intersection over union) loss function is widely recognized as the predominant loss function utilized within the YOLO series of algorithms. Nevertheless, it exhibits significant limitations in the detection of small targets and in the context of multi-scale fusion detection. Conversely, the NWD (noise-contrastive estimation with discretized representations) is less sensitive to variations in object scale, rendering it more effective for assessing the similarity of small objects. To improve the recognition accuracy of the manganese nodule dataset, we have augmented the loss function to incorporate the NWD + IoU double loss function, thereby facilitating a more comprehensive calculation of the total loss associated with the target.
In contrast to the IoU loss function, the NWD loss function [
25] is more adept at assessing the similarity of small objects characterized by a Gaussian distribution. The fundamental principle underlying this approach is to enhance the similarity between positive and negative samples by minimizing the KL (Kullback–Leible) divergence. This is achieved by optimizing the model through a comparative analysis of the similarities between the positive and negative samples. In this context, the positive sample represents the original input data, while the negative sample is generated by introducing noise to the original representation.
Given that the majority of objects do not conform to standard rectangular shapes, the process of delineating the small target boundary from the background frequently results in ambiguity. To enhance the precision of the loss function, the NWD loss function incorporates a two-dimensional Gaussian distribution, which allocates the maximum weight to the central point of the small target, with the weight progressively diminishing towards the periphery. For the purpose of this discussion, let us consider a bounding box with a size of
, where
x and
y represent the coordinates of the center point, and
w and
h denote the width and height of the bounding box. The interior elliptic equation can be expressed as follows (1):
where
is the center coordinate of the ellipse.
is the half-axis length along the
axis. Where,
.
When
is satisfied, the probability density function of the two-dimensional Gaussian distribution can be expressed as Equation (
2), and its probability is the degree function:
where
and
are, respectively, the following (3) and (4):
NWD calculates the distribution distance using the Wasserstein distance. For two two-dimensional Gaussian distributions
and
, the second-order Wasserstein distance between
and
can be simplified as follows:
The detailed procedure for the fusion of NWD and IOU is outlined as follows: first, the Wasserstein distance is computed between the predicted bounding box and the actual bounding box, which is then condensed into a one-dimensional tensor. Next, a scaling factor is established to equilibrate the Wasserstein loss with the IOU loss, with empirical evidence indicating that the
achieves its maximum value when the scaling factor is set to 0.5. Following this, the overall loss value is determined. Upon the integration of the NWD loss function into the framework, the overall loss function can be represented as follows (6):
Among them, is the total loss, is the ratio of the loss function, and is the ratio of the intersection and union of the real box and the prediction box.
The recognition rate of the enhanced YOLOv5 network model has shown improvement.
Figure 6 presents eight recognition cases that encompass a diverse range of object categories, sizes, and scene complexities, thereby providing a comprehensive assessment of the model’s generalization capabilities and robustness. The cases illustrate that the enhanced model demonstrates accuracy and effective localization performance in detecting various targets. Notably, in complex environments characterized by variations in illumination, occlusions, targets of differing scales, or background interference, the improved YOLOv5 is capable of accurately identifying the primary targets within the image and delivering precise bounding boxes. In this context, yellow labels represent actual seabed data, while purple labels denote a self-constructed dataset.
To enhance the assessment of the model’s performance, images of manganese nodules exhibiting overlapping regions and masking were identified. The following
Figure 7 displays the results predicted by the enhanced algorithm. The figure illustrates that the target can be accurately recognized when the overlapping part is small. However, accurate recognition becomes challenging when the overlapping part is large and the boundary is more blurred. The masked target can still be recognized, but the position of the prediction box may deviate to some extent when the mask is too large.