Next Article in Journal
Variations in the Abundance, Biodiversity, and Assemblage Structure of Larval Fish in the Restricted Waters of the Wang-an Light Fishery off Penghu, Taiwan
Previous Article in Journal
The Mesozoic Subduction Zone over the Dongsha Waters of the South China Sea and Its Significance in Gas Hydrate Accumulation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Grabbing Path Extraction of Deep-Sea Manganese Nodules Based on Improved YOLOv5

Department of Mechanical and Electrical Engineering, Ocean University of China, Qingdao 266100, China
*
Authors to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2024, 12(8), 1433; https://doi.org/10.3390/jmse12081433
Submission received: 14 June 2024 / Revised: 13 August 2024 / Accepted: 17 August 2024 / Published: 19 August 2024

Abstract

:
In an effort to enhance the efficiency and accuracy of deep-sea manganese nodule grasping behavior by a manipulator, a novel approach employing an improved YOLOv5 algorithm is proposed for the extraction of the shortest paths to manganese nodules targeted by the manipulator. The loss function of YOLOv5s has been improved by integrating a dual loss function that combines IoU and NWD, resulting in better accuracy for loss calculations across different target sizes. Additionally, substituting the initial C3 module in the network backbone with a C2f module is intended to improve the flow of gradient information while reducing computational demands. Once the geometric center of the manganese nodules is identified with the improved YOLOv5 algorithm, the next step involves planning the most efficient route for the manipulator to pick up the nodules using an upgraded elite strategy ant colony algorithm. Enhancements to the ACO algorithm consist of implementing an elite strategy and progressively decreasing the number of ants in each round. This method reduces both the number of iterations and the time required for each iteration, while also preventing the occurrence of local optimal solutions. The experimental findings indicate that the improved YOLOv5s detection algorithm boosts detection accuracy by 2.3%. Furthermore, when there are fewer than 30 target planning points, the improved algorithm requires, on average, 24% fewer iterations than the ACO algorithm to determine the shortest path. Additionally, the speed of calculation for each iteration is quicker while still providing the optimal solution.

1. Introduction

The ocean is rich in mineral resources, one of which is manganese nodules. These nodules are a form of ore that forms on the seabed, characterized by a black or brown surface and a spherical or bulky shape. They contain over 30 different metal elements, with manganese, copper, cobalt, and nickel being the most commercially valuable for extraction [1]. The extraction of deep-sea manganese nodules is primarily reliant on the utilization of deep-sea manipulators. Currently, the autonomous operation systems of vision-based manipulators exhibit challenges related to accuracy and efficiency. A significant focus in addressing these challenges is the improvement in both the detection success rate and the efficiency of grasping manganese nodules. Notably, advancements have been made by researchers in this domain. For instance, Bazeille et al. employed color features for the detection of underwater targets, with experimental findings suggesting that suboptimal lighting conditions—whether insufficient or excessive—negatively impact detection efficacy [2]. Christian et al. implemented an underwater target detection technique utilizing active contour methods, which delineates the target region by leveraging three distinct features: color, direction, and brightness. The integration of statistical approaches alongside active contour methodologies significantly mitigates the dependence on initial parameters for target detection. Nonetheless, it is important to note that the efficacy of detection may be compromised when the background color closely resembles that of the target [3].
Given the intricate and variable lighting conditions and scenes present on the seafloor, deep learning-based recognition methods demonstrate significant advantages. Currently, deep learning detection algorithms are categorized into two types: one-stage target detection algorithms and two-stage target detection algorithms. The two-stage target detection algorithms, in contrast to their one-stage counterparts, provide benefits such as enhanced classification accuracy and improved precision in detection box localization. A seminal contribution to the field was made by Girshick et al. with the introduction of the R-CNN algorithm, which marked the first application of convolutional neural networks for target detection [4]. Building upon this, Girshick et al. and Ren et al. introduced Fast R-CNN and Faster R-CNN algorithms, establishing a series of R-CNN algorithms [5,6]. The two-stage target detection algorithm is currently in the nascent phase of development within the field of underwater target detection, with numerous researchers actively working to refine the R-CNN series of algorithms. Yuan et al. introduced a secondary transfer learning approach that builds upon the Faster R-CNN framework, incorporating enhancements through the multi-scale retinex with color restoration (MSRCR) technique. Initially, the Open Images fish dataset was processed utilizing the ImageNet model to establish a preliminary network. Subsequently, an underwater fish dataset was collected, followed by a second transfer learning phase to produce the final network. Ultimately, the MSRCR algorithm was employed to improve the congruence between the underwater dataset and the Open Images fish dataset, effectively mitigating the challenges associated with the degradation of image quality in underwater environments [7]. Yang et al. proposed the QueryDet target detection algorithm, which enhances both the inference speed and the accuracy of the model by introducing the CSQ mechanism. This offers a new direction for future research [8].
The one-stage target detection algorithm does not generate candidate regions to directly predict regions during the detection process. Compared to the two-stage target detection method, the detection speed is faster. The one-stage target detection algorithm has better detection speed and accuracy. Therefore, it has more engineering application value. The main representative of the one-stage target detection algorithm is the YOLO series algorithm. Wang et al. proposed an improved YOLOv2 algorithm and applied it to detect fish images. It changed the last convolutional layer of the network model from 7 × 7 to 9 × 9 . Furthermore, in accordance with the specific requirements, the quantity of filters in the final convolutional layer was decreased. This modification markedly improved the detection accuracy of YOLOv2 for small fish [9]. Muksit et al. introduced an advanced version of the YOLOv3 model, designated as YOLO-Fish-1, which significantly enhanced the recognition accuracy of small target fish by optimizing the detection step size. Following this, they developed YOLO-Fish-2, which incorporated an increase in the spatial pyramid pooling of YOLO-Fish-1. This modification not only bolstered the robustness of the network model but also augmented its recognition accuracy in dynamic environments [10]. Chen et al. introduced an enhanced version of the YOLOv4 algorithm, which optimizes the feature extraction capabilities of CSPDarknet53 (cross-stage partial network Darknet53), the foundational architecture of the YOLOv4 network. This enhancement enables the algorithm to acquire weighted multi-scale features, subsequently utilizing the most pertinent features for the detection of underwater targets. Verification using the Brackish dataset demonstrated an improvement in average accuracy by 5.03% while achieving a detection speed of 15 frames per second [11]. Xu et al. introduced the PP-YOLOE target detection algorithm, which incorporates several enhancements to both the training strategy and inference speed. These modifications result in a significant increase in inference speed and training efficiency while preserving high accuracy. Consequently, the model demonstrates enhanced competitiveness in practical applications [12]. Wang et al. introduced the YOLOv7 algorithm, which integrates an ELAN module within its network architecture. This integration significantly enhances the model’s ability to capture long-range dependencies within images. Additionally, the convolutional approach employed, known as RepConvN, serves to decrease both the complexity of the model and its computational demands, all while maintaining a high level of accuracy [13]. Cai et al. introduced the Gold-YOLO algorithm, which incorporates a gather-and-distribute mechanism alongside a pre-training approach inspired by the masked autoencoder (MAE) framework. This innovative methodology significantly enhances the model’s ability to fuse multi-scale information, resulting in state-of-the-art (SOTA) performance across various datasets [14]. Wang et al. introduced the YOLOv9 algorithm, which has been optimized comprehensively in terms of network architecture and loss function. This optimization markedly enhances the model’s detection performance in complex environments, positioning it among the most advanced target detection algorithms currently available [15].
Concurrently, advancements in research utilizing the Transformer model have been notable. Huang et al. introduced a two-stage target detection algorithm for AD-DETR, which markedly enhances the accuracy of target detection while simultaneously decreasing the model’s inference speed. This improvement is achieved through the incorporation of multi-scale feature integration, alongside enhancements to the training objectives and matching strategies [16]. Liu et al. introduced an enhanced version of the DAB-DETR algorithm through the development of a DETR algorithm. This new approach integrates self-attention and cross-attention mechanisms within a two-layer framework, thereby significantly enhancing the precision of target feature extraction in intricate environments [17].
In comparison to one-stage algorithms, two-stage target detection algorithms exhibit superior performance in terms of accuracy and localization. However, the computational speed and cost constraints associated with two-stage algorithms render them unsuitable for scenarios that require real-time processing. The evolution of target detection algorithms has resulted in incremental enhancements in the detection accuracy of one-stage algorithms; however, this progress has concurrently led to a substantial increase in the number of parameters and computational complexity, thereby diminishing their applicability for real-time detection and processing tasks. The challenge that arises is how to optimize the model for real-time detection scenarios by minimizing its weight without sacrificing accuracy. The YOLOv5 algorithm represents a lightweight model that maintains high accuracy while providing rapid inference speeds, making it particularly well-suited for real-time scene processing. In the context of current underwater operations, the identification and manipulation tasks performed by manipulators predominantly rely on manual identification and positioning, which lack autonomy and are characterized by inefficiency and time consumption. To mitigate these challenges, this study proposes a method for the autonomous recognition and grasping of manganese nodules. This method incorporates a lightweight algorithm with high inference speed and enhances it to further boost accuracy and grasping efficiency. Furthermore, the elite strategy ant colony algorithm is employed to determine the shortest grasping path from a single-frame image, thereby reducing the time required for grasping. This approach offers a conceptual framework for the autonomous operation of underwater manipulators.

2. Materials and Methods

Stereolabs ZED

The apparatus employed in this study is the ZED 2 binocular stereo camera manufactured by Stereolabs. This camera features a resolution of 3840 × 1080 pixels (1080P) and operates at a refresh rate of 30 Hz. The sealing treatment applied to the exterior of the camera is illustrated in Figure 1, where (1) denotes the Stereolabs ZED camera and (2) indicates the sealing structure.
In the course of the experiment, the ZED binocular camera and the manipulator are configured in an eye-to-hand placement mode, as illustrated in Figure 1. This configuration entails that both the camera and the manipulator are stationary, thereby maintaining a consistent coordinate transformation relationship between the camera and the base of the manipulator. Communication between the ZED binocular camera and the manipulator is facilitated via a handheld controller. The comprehensive communication process of the entire system is depicted in Figure 2.

3. Manipulator Grasping Method Based on Binocular Vision

Feature extraction for underwater targets typically involves the utilization of color, texture, and shape features derived from enhanced underwater imagery. These extracted features are subsequently classified using a designated classifier. The practical application outcomes of these methodologies are illustrated in Figure 3. However, the aforementioned methods encounter several challenges in the feature extraction process. Notably, the histogram of the extracted color features fails to adequately represent the spatial distribution information of the image. In instances of color similarity, the degree of waveform coincidence is elevated, leading to comparable two-dimensional color histograms, which complicates classifier recognition. Furthermore, relying solely on shape features does not effectively capture the spatial information of the image, and variations such as target rotation and occlusion can adversely affect the recognition rate. While texture feature extraction demonstrates greater robustness, the irregularities in underwater illumination may produce misleading textures of the target. Additionally, the target may exhibit diverse texture features at varying resolutions, which can further influence the recognition rate. Traditional methods also face significant difficulties in the identification of manganese nodules.
Traditional methods have proven inadequate for the effective identification of manganese nodules. This study introduces a novel approach utilizing deep learning techniques for the recognition of manganese nodules. Specifically, the YOLOv5 model has been optimized to enhance the detection rate of manganese nodules, thereby facilitating the identification of a greater number of geometric centers associated with these nodules. Furthermore, to improve the efficiency of the manipulator and minimize energy consumption, an enhanced elite strategy ant colony algorithm is employed subsequent to the recognition of geometric centers. This algorithm is utilized to determine the path characterized by the highest pheromone concentrations, thereby facilitating the extraction of the shortest route for the manipulator to grasp the manganese nodules.

4. Improved YOLOv5 for Manganese Nodule Detection

4.1. Data Acquisition and Selection

4.1.1. Data Acquisition

The dataset was compiled in the underwater robotics laboratory at the Ocean University of China on 18 October 2023. The images were captured using a ZED2 binocular stereo camera, San Francisco, USA with a resolution of 672 × 376 pixels. To ensure a diverse range of samples, between 5 and 60 nodules were randomly positioned in the pool, with portions of them submerged in sediment. Various levels of brightness and angles of illumination were employed during the sampling process to further enhance sample diversity. Following a selection process, 1200 images were retained, and the dataset was subsequently augmented to include a total of 2500 images through techniques such as mosaic creation, rotation, and the introduction of noise. The dataset was partitioned into a training set and a validation set in a 7:3 ratio. Image labeling was conducted using Labelimg, with the type and coordinate information of the labeled results stored in *.txt format.

4.1.2. Data Selection

Currently, the availability of open-source images depicting manganese nodules from actual seabed environments is quite limited. Consequently, this study utilizes datasets derived from simulated laboratory environments. To effectively replicate the recognition context of the marine environment, this research integrates various datasets and identifies the most appropriate combinations through a series of comparative experiments. The experimental groups are defined as follows:
(1) Group A consists of both the training set and the validation set sourced from laboratory images;
(2) Group B comprises a training set derived from laboratory images, while the validation set is obtained from publicly accessible images;
(3) Group C involves a combination of laboratory images and open-source images at a 1:1 ratio, with both the training and validation sets being formed from this integrated dataset.
In this paper, the three datasets were trained using the same YOLOv5 framework, and the experimental data are presented in Table 1.
It can be seen from the data that the A P value of comparison group B is relatively low. This indicates that the images taken by the laboratory make it difficult to identify the real seabed images from the open-source network. In order to further verify the above conclusions, this paper compares the training weights of group A to identify real seabed images and finds that the training weights of group A are not effective in identifying real seabed images. It can be observed that the seabed environment is complex, with changing light and scenery. Therefore, we set the dataset as the comparison group C, added some real seabed images into the training process, and supplemented the image features that some laboratory images do not have so as to better identify the manganese nodules in real-world environments.

4.2. Training Environment

The system for training runs on Windows 11 and is equipped with a 12th Gen Intel®CoreTM i5-12400 2.50 GHz CPU, 32 GB of memory, and two NVIDIA GeForce RTX 3060 graphics cards. The training framework is PyTorch, and Anaconda is used to establish the environment configuration for related software.
The proposed algorithm is implemented using Python 3.9, and the code is compiled with PyTorch 2.0.

5. Improved YOLOv5

5.1. YOLOv5 Model’s Specific Modifications and Performance Testing

The YOLO series [18,19,20,21,22] is a one-stage object detection algorithm characterized by a framework comprising four key components: the input section, the trunk section, the neck section, and the prediction section. Notably, YOLOv5 distinguishes itself from other models within the YOLO series by its streamlined architecture, which facilitates enhanced detection speed and greater flexibility in deployment. A significant feature of YOLOv5 is its implementation of mosaic data augmentation, which enhances the dataset’s diversity by combining four randomly selected images. The model employs a feature extraction structure that includes CSPDarknet53 and an SPP layer, with the extracted features subsequently integrated using PANet [23]. YOLOv5 is available in five distinct variants: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The variant examined in this study is an improved version of YOLOv5s, primarily tailored for object detection in intricate deep-sea environments. Figure 4 illustrates the structural design of the improved YOLOv5 model.
Through experiments, it has been found that the YOLOv5s model still exhibits low accuracy when applied to the identification of manganese nodule datasets. Therefore, the model has been modified in the following two aspects:
(1) The C3 module is improved to the C2f module.
The architecture of the C2f (cross-stage partial network) module [24] has been enhanced based on the C3 module. Certain aspects of its design bear a resemblance to the ELAN architecture, which facilitates the parallelization of multiple gradient flow branches, thereby enabling the acquisition of more comprehensive gradient flow information while maintaining a lightweight model. This approach aims to enhance both the training efficacy and convergence speed of the model. To further minimize computational demands and improve efficiency, the number of Bottleneck input channels in the C2f module has been reduced to 50% of that in the preceding level, and the kernel size of the initial convolutional layer has been modified from 6 × 6 to 3 × 3. Additionally, a segmentation has been introduced between the CBS and Bottleneck components to partition the feature map.
The structure of the C2f module is illustrated in Figure 5. Within this module, the Bottleneck component consists of two 1 × 1 convolutional layers and one 3 × 3 convolutional layer. The primary function of the two 1 × 1 convolutional layers is to modify the dimensions of the input and output layers, thereby establishing the 3 × 3 convolutional layer as the bottleneck in terms of input/output dimensions. This Bottleneck architecture effectively decreases the number of parameters and enhances computational efficiency while maintaining the original level of accuracy. Through a series of experiments involving various placements of the C2f module and differing quantities of replacements, it was determined that the optimal A P value was attained by substituting the initial C3 module of the Darknet53 network backbone with a C2f module.
(2) NWD + IoU double loss function
To address the challenge of balancing the computation of the loss function for targets of varying scales, we propose the NWD+IoU double loss function. The IoU (intersection over union) loss function is widely recognized as the predominant loss function utilized within the YOLO series of algorithms. Nevertheless, it exhibits significant limitations in the detection of small targets and in the context of multi-scale fusion detection. Conversely, the NWD (noise-contrastive estimation with discretized representations) is less sensitive to variations in object scale, rendering it more effective for assessing the similarity of small objects. To improve the recognition accuracy of the manganese nodule dataset, we have augmented the loss function to incorporate the NWD + IoU double loss function, thereby facilitating a more comprehensive calculation of the total loss associated with the target.
In contrast to the IoU loss function, the NWD loss function [25] is more adept at assessing the similarity of small objects characterized by a Gaussian distribution. The fundamental principle underlying this approach is to enhance the similarity between positive and negative samples by minimizing the KL (Kullback–Leible) divergence. This is achieved by optimizing the model through a comparative analysis of the similarities between the positive and negative samples. In this context, the positive sample represents the original input data, while the negative sample is generated by introducing noise to the original representation.
Given that the majority of objects do not conform to standard rectangular shapes, the process of delineating the small target boundary from the background frequently results in ambiguity. To enhance the precision of the loss function, the NWD loss function incorporates a two-dimensional Gaussian distribution, which allocates the maximum weight to the central point of the small target, with the weight progressively diminishing towards the periphery. For the purpose of this discussion, let us consider a bounding box with a size of S = ( a x , a y , w , h ) , where x and y represent the coordinates of the center point, and w and h denote the width and height of the bounding box. The interior elliptic equation can be expressed as follows (1):
x μ x 2 σ x 2 + y μ y 2 σ y 2 = 1
where ( μ x , μ y ) is the center coordinate of the ellipse. σ x , σ y is the half-axis length along the x , y axis. Where, μ x = a x , μ y = a y , μ x = w 2 , μ y = h 2 .
When ( x μ ) 1 ( x μ ) = 1 is satisfied, the probability density function of the two-dimensional Gaussian distribution can be expressed as Equation (2), and its probability is the degree function:
f ( x μ , ) = exp 1 2 ( x μ ) T 1 ( x μ ) 2 π | | 1 2
where μ and are, respectively, the following (3) and (4):
μ = c x c y
= w 2 4 0 0 h 2 4
NWD calculates the distribution distance using the Wasserstein distance. For two two-dimensional Gaussian distributions μ 1 = N ( m 1 , 1 ) and μ 2 = N ( m 2 , 2 ) , the second-order Wasserstein distance between μ 1 and μ 2 can be simplified as follows:
N W D N a , N b = exp W 2 2 N a , N b G J Zivang
The detailed procedure for the fusion of NWD and IOU is outlined as follows: first, the Wasserstein distance is computed between the predicted bounding box and the actual bounding box, which is then condensed into a one-dimensional tensor. Next, a scaling factor is established to equilibrate the Wasserstein loss with the IOU loss, with empirical evidence indicating that the A P achieves its maximum value when the scaling factor is set to 0.5. Following this, the overall loss value is determined. Upon the integration of the NWD loss function into the framework, the overall loss function can be represented as follows (6):
L total = 1 N W D N a N b N W D ratio + 1 N W D ratio ( 1 I o U )
Among them, L total is the total loss, N W D ratio is the ratio of the N W D loss function, and I o U is the ratio of the intersection and union of the real box and the prediction box.
The recognition rate of the enhanced YOLOv5 network model has shown improvement. Figure 6 presents eight recognition cases that encompass a diverse range of object categories, sizes, and scene complexities, thereby providing a comprehensive assessment of the model’s generalization capabilities and robustness. The cases illustrate that the enhanced model demonstrates accuracy and effective localization performance in detecting various targets. Notably, in complex environments characterized by variations in illumination, occlusions, targets of differing scales, or background interference, the improved YOLOv5 is capable of accurately identifying the primary targets within the image and delivering precise bounding boxes. In this context, yellow labels represent actual seabed data, while purple labels denote a self-constructed dataset.
To enhance the assessment of the model’s performance, images of manganese nodules exhibiting overlapping regions and masking were identified. The following Figure 7 displays the results predicted by the enhanced algorithm. The figure illustrates that the target can be accurately recognized when the overlapping part is small. However, accurate recognition becomes challenging when the overlapping part is large and the boundary is more blurred. The masked target can still be recognized, but the position of the prediction box may deviate to some extent when the mask is too large.

5.2. Analysis of the Computational Complexity and Resource Requirements of the Improved Model

YOLOv5’s initial model has a floating-point operation count of 21 GFLOPS and 9 million parameters on the dataset used in this study. In comparison, the improved model has increased the floating-point operation count by 7 GFLOPS and the parameter count by 4.6 million. Additionally, in the feature layer where C2f is located, the parameter count has increased from 18,000 to 38,000. This indicates that the improved model has increased computational complexity, but still maintains a significant lightweight advantage compared to models like YOLO8. While the YOLOv5 model is relatively lightweight in terms of computational resources, deep learning-based object detection algorithms still require GPU support, leading to the need for high hardware specifications for real-time detection of the YOLOv5 model. Moreover, differences in lighting conditions and the presence of more complex background interference in the actual scenarios may pose limitations during deployment.

6. The Shortest Grasping Path Extraction of the Manipulator

In the context of real-time video transmission, the camera captures multiple objects within each frame. Strategic planning of the sequence in which these targets are grasped can significantly reduce the time required for grasping and enhance the overall efficiency of the operation. The procedure for frame-by-frame grasping path planning is delineated as follows:
Step 1: Identify the center point of the target object within each frame and record the corresponding coordinates.
Step 2: Employ an improved elite strategy ant colony algorithm to devise a path based on the identified center point coordinates.
Step 3: Integrate the planned path with the actual image to facilitate the manipulation of the manipulator’s rotation.
The specific implementation steps of the visual part are illustrated in Figure 8.

6.1. Basic Ant Colony Optimization Algorithm

To address the challenge of path extraction among various targets within a single image frame, we propose the implementation of the ACO algorithm [26,27,28]. ACO is a heuristic optimization technique designed to tackle pathfinding issues. In their natural foraging behavior, ants typically select paths at random. Upon traversing a particular route, ants deposit a chemical substance known as pheromones, which facilitates communication among the ant population. Other ants, upon encountering these pheromones, can detect their concentration through their antennae and are inclined to favor paths with higher pheromone levels. Given that shorter paths require less time for traversal, the frequency of ant passage along these routes increases, resulting in a greater accumulation of pheromones. This positive feedback loop encourages an increasing number of ants to follow the paths with elevated pheromone concentrations, ultimately leading to the identification of the optimal route. The probability of an ant moving from point a to point b at time t is as follows:
P a b k ( t ) = τ a b ( t ) α · η a b ( t ) β c S ( k ) τ a c ( t ) α · η a c ( t ) β , b S ( k ) 0 , o t h e r w i s e
where S ( k ) ( k = 1 , 2 , , m ) is the set of the next locations that the ant needs to traverse, m is the number of ants, and k is the number of locations that the ant has traversed, η a b is heuristic information, with the specific value being the reciprocal of the distance between point a and point b , α , β is the influence degree of pheromone and heuristic pheromone on ant path selection, and n is the total number of points. τ a b ( t ) is the number of pheromones between corresponding points a and b at time t. When the ant traverses all the points, the pheromone corresponding to each path is updated as follows (8):
τ a b ( t + 1 ) = ( 1 δ ) τ a b ( t ) + Δ a b
where Δ a b is the pheromone left between a and b after k cycles, and δ ( 0 , 1 ) is the degree of attenuation of the pheromone.

6.2. Enhanced Elite Strategy Ant Colony Algorithm

The ant colony optimization algorithm continues to encounter challenges related to its relatively slow speed in identifying the shortest path. To address this issue and to improve the efficiency of manipulator grasping, we propose an enhanced elite strategy ant colony algorithm [29,30]. This elite strategy augments the pheromone levels within the framework of the ant colony optimization algorithm. The fundamental principle underlying this approach is that when an ant identifies the optimal path, the pheromone concentration along that path is increased, thereby promoting a higher pheromone density on the most favorable route and facilitating quicker convergence. The elite strategy ant colony algorithm incorporates an additional positive feedback component into Equation (10), which is derived from the original remaining pheromone represented in Equation (9).
Δ τ a b = k = 1 m Δ τ a b k
Δ τ a b = k = 1 m Δ τ a b + E P F V
where Δ τ a b is the total residual pheromone, Δ τ a b k is the residual pheromone of ant k between a and b, and EPFV (extra positive feedback value) is the additional positive feedback value, which can be expressed as Equation (11):
E P F V = c L b p , ( a , b ) B P 0 , o t h e r w i s e
where c is the parameter to adjust the additional feedback value, L b p is the length of the best path, and B P (the best path) is the current best path.
In addition to the introduction of the elite strategy, we also made a decreasing improvement in the number of ant colonies and proposed an improved elite strategy ant colony algorithm. On the premise of avoiding the local optimal solution, the number of ant colonies decreases in the round-by-round traversal, which can be expressed as follows (12):
m n = m n 1 α ( n 1 )
where n n = 1 , 2 , 3 is the number of rounds traversed by ants, m n is the number of ants in the round n, and α is the attenuation coefficient of the number of ants. The decreasing number of ants helps to further reduce the extraction time of the optimal path and improve the grasping efficiency of the manipulator.
The initial number of ants should not be set too large or too small. An excessively large value will result in slow convergence of the algorithm, directly impacting the response speed of the manipulator. Conversely, a value that is too small will increase the likelihood of obtaining a local optimal solution. By testing the initial number of ants from small to large, we found that when the initial number of ants is 1.3 times the number of manganese nodules, there will be no local optimal solution (50 tests). At the same time, to ensure clarity and recognition accuracy, the number of manganese nodules in a single frame of a manganese nodule image should not exceed 30. When the quantity of manganese nodules surpasses 30, the quality of the shortest path extraction diminishes, leading to an increased likelihood of encountering local optimal solutions during experimentation. In certain instances, even after more than 1000 iterations, the global optimal solution remains elusive. Concurrently, the blurring of image boundaries intensifies, resulting in inaccuracies in the delineation of the target’s prediction box, which adversely impacts the localization of the central point. This suggests that the optimal number of points for traversal by the ants should be limited to fewer than 30.

6.3. Algorithm Flow and Performance Test

The ant colony optimization algorithm plays a pivotal role in establishing the initial quantity of ants and the pheromone evaporation coefficient. Generally, the initial number of ants is set between 10 and 50, while the pheromone evaporation coefficient typically ranges from 0.1 to 0.9. To optimize traversal time and improve computational efficiency, it is recommended to reduce both the initial number of ants and the pheromone evaporation coefficient. However, excessively low values for the initial number of ants may result in prolonged traversal times for the shortest path and yield unstable results. Likewise, a very low pheromone evaporation coefficient can lead to the algorithm’s tendency to repeatedly traverse previously established paths, thereby increasing the risk of converging on a local optimum. Due to its pronounced sensitivity, extensive experimentation has indicated that configuring the initial number of ants to 1.3 times the target value and setting the pheromone evaporation coefficient to 0.2 can substantially decrease computation time while preserving the quality of the solutions obtained.
The initialization parameters encompass the following elements: the initial quantity of ant colonies is established at 1.3 times the desired number of manganese nodules. The termination criterion is defined as the situation in which only one ant remains. The information evaporation coefficient is assigned a value of 0.1, while the supplementary feedback value for the residual pheromone is determined to be 0.2, in accordance with the target quantity. Furthermore, the selection of the attenuation coefficient for the number of ant colonies must satisfy the termination condition when the number of iterations exceeds 30. The flowchart illustrating the enhanced elite ant colony algorithm is presented in Figure 9.
Figure 10 shows the effect diagram of the enhanced elite ant colony algorithm. It can be seen that by reducing the number of ant colonies and increasing the information residue, it is still possible to extract the shortest path for obtaining manganese nodules.

7. Experimental Evaluation Methods and Conclusions

7.1. Detection of Manganese Nodules Based on Enhanced YOLOv5

In this study, the detection of manganese nodules is evaluated based on precision ( P ) , recall ( R ) , mean average precision ( m A P ) , and average precision ( A P ) . The calculation method is as follows (13):
P = 1 P T P + F P × 100 % P = T P T P + F P × 100 % A P = 0 1 p ( r ) d r m A P = 1 m A P i
where T P is the number of prediction boxes with correct position recognition, F P is the number of prediction boxes with wrong position recognition, and F N is the number of prediction boxes with wrong position recognition. Table 2 shows the important hyperparameter settings for training the improved YOLOv5 model.
To assess the efficacy of the validation enhancement strategy, the influence of each improvement approach on the model was evaluated through ablation experiments. The outcomes of these enhancements were measured using the A P value as a benchmark. The findings are detailed in Table 3. The data presented in the table indicate that the improvement strategy employed in this study yields the highest A P value and demonstrates the most substantial effect.
In the present dataset, there exists a singular object category designated for detection, resulting in m A P being equivalent to A P value. The m A P for the enhanced YOLOv5 model is recorded at 74.5%, with a recall rate of 83% and a precision rate of 67.2%, indicating a notable enhancement relative to the pre-enhanced model. To assess the efficacy of the improved YOLOv5 model, a comparative analysis is conducted against the original model’s performance. As illustrated in Figure 11, the validation outcomes for the enhanced YOLOv5 model demonstrate reduced box regression loss and confidence loss in comparison to the original model.
Figure 12 shows the comparison between the accuracy of the improved model and that of the original model. It can be seen from the figure that the A P of the improved model is higher than that of the original model.
To further examine the influence of illumination on the detection of manganese nodules, we performed a series of comparative experiments by modifying the brightness levels of all images within the dataset to 0.2 times, 0.5 times, and 1.5 times their original values. The experimental data presented in Table 4 below were obtained through training the same model.
The experimental data indicate that the model’s detection accuracy is somewhat influenced by variations in brightness; however, the reduction in A P value is minimal. The subsequent four images, as illustrated in Figure 13, present the prediction outcomes across different brightness levels. The results clearly demonstrate that manganese nodules remain identifiable even under conditions of substantially reduced brightness. This observation implies that the model exhibits a degree of robustness to fluctuations in brightness.
Table 5 summarizes the accuracy, recall, and precision of MobileNetV3, TOOD, YOLOv4, Faster RCNN, YOLOv7-tiny, YOLOv8, YOLOv5s, and YOLOv5enhanced at a confidence threshold of 0.5. Among these models, only YOLOv8 has a higher A P value than YOLOv5enhanced. However, the YOLOv8 model has a significantly larger number of parameters and high computational complexity, making it unsuitable for real-time processing applications.

7.2. Path Extraction Based on Improved Elite Ant Colony Algorithm

A total of 15 images, each with designated coordinates, were randomly selected to assess the efficacy of the enhanced algorithm. The concentration of manganese nodule targets within these images ranged from 15 to 25. The results of the experiments are presented in Table 6. A series of ten path extraction experiments were performed on images depicting various manganese nodules, with the highest and lowest values being excluded from the analysis. The average value represents the number of iterations required for optimal path extraction. The first column of the table indicates the number of manganese nodules present in each individual image. The data reveal that, on average, the number of iteration rounds required by the improved algorithm has been reduced by 24%.

8. Conclusions

This study presents a vision-based methodology for the extraction of grasping paths for deep-sea manganese nodules, offering a viable solution for manipulators tasked with this objective. The approach incorporates four enhancements to the YOLOv5 model and the ant colony optimization algorithm. Firstly, the implementation of a dual loss function effectively balances the loss calculations across targets of varying scales. Secondly, the C2f module enhances the flow of gradient information. Additionally, the number of ants is reduced round by round and the elite strategy is introduced to speed up the pheromone accumulation of the best path. Validation through a custom dataset and real seabed photographic experiments indicate a 2.3% improvement in detection accuracy compared to the pre-enhancement model. Furthermore, the average number of iterations required for path extraction is reduced by 24%, alongside a decrease in the duration of individual iterations, thereby demonstrating that the proposed method achieves superior efficiency and accuracy.
Currently, there are fewer cases of underwater manipulator operations, and there are still many uncertainties. The objective of this study is to decrease computational time while maintaining recognition accuracy to facilitate improved applications in underwater operations. However, challenges arise when the quantity of manganese nodules within a single frame is excessively high, leading to potential issues such as target omission during detection. Furthermore, the extraction of the shortest path may result in a local optimal solution, thereby prolonging computation time. Consequently, future research should explore the integration of path planning methodologies, such as reinforcement learning, with YOLO series algorithms to optimize planning for individual targets. Additionally, the implementation of multimodal fusion and the incorporation of sensors to support localization and recognition may significantly enhance the success rate of grasping operations. The deployment of the model is a critical factor that warrants careful consideration. Discrepancies between the experimental environment and the actual deployment context can be mitigated through the application of domain adaptation techniques, which adjust the source domain model to better correspond with the data distribution in the target domain. This approach aims to minimize the differences between the two environments and improve the overall performance of the model.

Author Contributions

C.C.: conceptualization, methodology, experimental demonstration, data processing, writing—original draft. P.M.: project management, writing—review and editing, funding acquisition. Q.Z.: methodology, writing—review and edition. G.L.: project management, funding acquisition. Y.X.: project management, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge the support of the Laoshan Laboratory (LSKJ202203500), the Taishan Scholars Program of Shandong Province (No. tstp20231215), and China Postdoctoral Science Foundation (No. 2021M693020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the first author. These data are not publicly available due to the need for future work.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:
ACOant colony optimization

References

  1. Yu, L.; Zhang, P.; Xiao, G.; Zhang, L. Experimental Research on Cobaltic Existence State and Distill of Ocean Manganese Nodule. J. Jilin Univ. (Earth Sci. Ed.) 2012, 39, 1475–1481. [Google Scholar]
  2. Bazeille, S.; Quidu, I.; Jaulin, L. Identification of underwater man-made object using a colour criterion. In Proceedings of the Conference on Detection and Classification of Underwater Targets, Edinburgh, UK, 18–19 September 2007; Volume 29. [Google Scholar]
  3. Chienyao, W.; Phlypo, R.T. A Fully Automated Method to Detect and Segment a Manufactured Object in an Underwater Color Image. EURASIP J. Adv. Signal Process. 2010, 2010, 568092. [Google Scholar]
  4. Girshick, R.; Donahue, J.; Darrell, T. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  5. Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  7. Zeng, L.; Sun, B.; Zhu, D. Underwater target detection based on Faster R-CNN and adversarial occlusion network. Eng. Appl. Artif. Intell. 2021, 100, 104190. [Google Scholar] [CrossRef]
  8. Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. arXiv 2022, arXiv:2013.09136. [Google Scholar]
  9. Wang, M.; Liu, M.; Zhang, F.; Lei, G.; Guo, J.; Wang, L. Fast Classification and Detection of Fish Images with YOLOv2. In Proceedings of the 2018 OCEANS-MTS/IEEE Kobe Techno-Oceans (OTO), Kobe, Japan, 28–31 May 2018; pp. 1–4. [Google Scholar]
  10. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  11. Chen, L.; Liu, Z.; Tong, L.; Jiang, Z.; Wang, S.; Dong, J.; Zhou, H. Underwater object detection using Invert Multi-Class Adaboost with deep learning. In Proceedings of the J2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
  12. Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
  13. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  14. Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
  15. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13613. [Google Scholar]
  16. Huang, Y.; Yuan, G. AD-DETR: DETR with asymmetrical relation and decoupled attention in crowded scenes. Math. Biosci. Eng. MBE 2023, 20, 14158–14179. [Google Scholar] [CrossRef] [PubMed]
  17. Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar]
  18. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar]
  19. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
  20. Dang, T.L.; Nguyen, G.T.; Cao, T. Object Tracking Using Improved Deep_Sort_YOLOv3 Architecture. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2020; pp. 961–969. [Google Scholar]
  21. Chen, W.; Lu, S.; Liu, B.; Li, G.; Qian, T. Detecting Citrus in Orchard Environment by Using Improved YOLOv4. Sci. Program. 2020, 2020, 8859237. [Google Scholar] [CrossRef]
  22. Dong, X.; Yan, S.; Duan, C. A lightweight vehicles detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022, 113, 104914. [Google Scholar] [CrossRef]
  23. Mei, Y.; Fan, Y.; Zhang, Y.; Yu, J.; Zhou, Y.; Liu, D.; Fu, Y.; Huang, T.S.; Shi, H. Pyramid Attention Network for Image Restoration. Int. J. Comput. Vis. 2023, 131, 3207–3225. [Google Scholar] [CrossRef]
  24. Reis, D.; Kupec, J.; Hong, J.; Ahmad, D. Real-Time Flying Object Detection with YOLOv8. arXiv 2024, arXiv:2305.09972. [Google Scholar]
  25. Wang, J.; Chang, X.; Yang, W. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2022, arXiv:2110.13389. [Google Scholar]
  26. Konatowski, S.; Pawłowski, P. Application of the ACO algorithm for UAV path planning. Prz. Elektrotech. 2019, 95, 117–121. [Google Scholar] [CrossRef]
  27. Jie, Z.; Xiuqim, P. Path Planning for Mobile Robots in Complex Environments Based on Improved ant Colony Algorithm; Springer Nature: Cham, Switzerland, 2022; pp. 3–13. [Google Scholar]
  28. Liu, Z.; Liu, J. Improved ant colony algorithm for path planning of mobile robots based on compound prediction mechanism. J. Intell. Fuzzy Syst. 2022, 44, 1–16. [Google Scholar] [CrossRef]
  29. Meng, X.; Zhu, X.; Zhao, J. Obstacle Avoidance Path Planning Using the Elite Ant Colony Algorithm for Parameter Optimization of Unmanned Aerial Vehicles. Arab. J. Sci. Eng. 2023, 48, 2261–2275. [Google Scholar] [CrossRef]
  30. Zhang, J.; Yang, J.G.; Qin, W.; Li, H.; Xu, Z.G. An Improved Ant Colony Algorithm Based Dynamic Scheduling Method in Job Shop with Parallel Machines. Adv. Mater. Res. 2012, 628, 304–309. [Google Scholar] [CrossRef]
Figure 1. Grab system position diagram.
Figure 1. Grab system position diagram.
Jmse 12 01433 g001
Figure 2. Grab system connection diagram.
Figure 2. Grab system connection diagram.
Jmse 12 01433 g002
Figure 3. (a) Initial picture, (b) two-dimensional color histogram, (c) RGB color histogram, (d) texture of manganese nodules, and (e) edge of manganese nodules.
Figure 3. (a) Initial picture, (b) two-dimensional color histogram, (c) RGB color histogram, (d) texture of manganese nodules, and (e) edge of manganese nodules.
Jmse 12 01433 g003
Figure 4. Enhanced network structure.
Figure 4. Enhanced network structure.
Jmse 12 01433 g004
Figure 5. C2f module structure diagram.
Figure 5. C2f module structure diagram.
Jmse 12 01433 g005
Figure 6. Detection diagram of manganese nodules.
Figure 6. Detection diagram of manganese nodules.
Jmse 12 01433 g006
Figure 7. Detection diagram of overlap and masking.
Figure 7. Detection diagram of overlap and masking.
Jmse 12 01433 g007
Figure 8. From left to right, is the initial picture, coordinate labeling picture, and shortest path planning pictures.
Figure 8. From left to right, is the initial picture, coordinate labeling picture, and shortest path planning pictures.
Jmse 12 01433 g008
Figure 9. Enhanced elite ant colony algorithm flow chart.
Figure 9. Enhanced elite ant colony algorithm flow chart.
Jmse 12 01433 g009
Figure 10. Path extraction effect diagram.
Figure 10. Path extraction effect diagram.
Jmse 12 01433 g010
Figure 11. The loss comparison between the improved model and the original model. (a) The box regression loss represents the validation results, and (b) the confidence loss represents the validation results.
Figure 11. The loss comparison between the improved model and the original model. (a) The box regression loss represents the validation results, and (b) the confidence loss represents the validation results.
Jmse 12 01433 g011
Figure 12. The AP comparison between the improved model and the original model.
Figure 12. The AP comparison between the improved model and the original model.
Jmse 12 01433 g012
Figure 13. Brightness detection result comparison for 0.2, 0.5, 1.0, 1.5.
Figure 13. Brightness detection result comparison for 0.2, 0.5, 1.0, 1.5.
Jmse 12 01433 g013
Table 1. Dataset comparison experimental result.
Table 1. Dataset comparison experimental result.
Comparison GroupsAverage Precision/%
A94.1
B36.4
C72.2
Table 2. Setting of hyperparameters.
Table 2. Setting of hyperparameters.
Parameter NameParameter Values
Depth multiple0.5
Width multiple0.65
Lr00.01
Lrf0.2
Batch-size20
Table 3. Results of ablation experiment.
Table 3. Results of ablation experiment.
C2fNWDAP%
72.2
73.7
72.9
74.5
“✓” indicates that the current module was used, and “✕” indicates that it was not used.
Table 4. Different brightness training results.
Table 4. Different brightness training results.
Brightness RatioAP/%
0.272.5
0.574.3
1.074.5
1.573.7
Table 5. Different model training results.
Table 5. Different model training results.
AlgorithmAP/%P/%R/%
MobileNetV367.278.761.7
TOOD73.483.767.9
YOLOv468.480.761.3
Faster R-CNN73.786.068.5
YOLOv7-tiny73.883.367.4
YOLOv878.185.772.3
YOLOv5s72.283.566.1
YOLOv5enhanced74.585.267.1
Table 6. Path extraction results.
Table 6. Path extraction results.
Number of Manganese NodulesOptimal Path LengthBasic RoundsImprove RoundsRelative Error
71211110
815312.31.50
91471320
107523217.40
1115473.530
1298839.68.80
1312934.33.30
1417729.37.80
1618917.57.20
1710,46823.4200
18921917.8140
2014,51235.329.9 1.4 %
22625125.321.50
2412,13022.520.5 2.1 %
2612,09327.619.3 2.6 %
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, C.; Ma, P.; Zhang, Q.; Liu, G.; Xie, Y. Grabbing Path Extraction of Deep-Sea Manganese Nodules Based on Improved YOLOv5. J. Mar. Sci. Eng. 2024, 12, 1433. https://doi.org/10.3390/jmse12081433

AMA Style

Cui C, Ma P, Zhang Q, Liu G, Xie Y. Grabbing Path Extraction of Deep-Sea Manganese Nodules Based on Improved YOLOv5. Journal of Marine Science and Engineering. 2024; 12(8):1433. https://doi.org/10.3390/jmse12081433

Chicago/Turabian Style

Cui, Chunlu, Penglei Ma, Qianli Zhang, Guijie Liu, and Yingchun Xie. 2024. "Grabbing Path Extraction of Deep-Sea Manganese Nodules Based on Improved YOLOv5" Journal of Marine Science and Engineering 12, no. 8: 1433. https://doi.org/10.3390/jmse12081433

APA Style

Cui, C., Ma, P., Zhang, Q., Liu, G., & Xie, Y. (2024). Grabbing Path Extraction of Deep-Sea Manganese Nodules Based on Improved YOLOv5. Journal of Marine Science and Engineering, 12(8), 1433. https://doi.org/10.3390/jmse12081433

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop