BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection

Cao, Ruicheng; Zhang, Ruiteng; Yan, Xinyue; Zhang, Jian

doi:10.3390/s24227411

Open AccessArticle

BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection

¹

School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, China

²

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

³

College of Computer Science, Chongqing University, Chongqin 400044, China

⁴

School of Tropical Agriculture and Forestry, Hainan University, Haikou 571158, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(22), 7411; https://doi.org/10.3390/s24227411

Submission received: 11 October 2024 / Revised: 8 November 2024 / Accepted: 12 November 2024 / Published: 20 November 2024

(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Degraded underwater images decrease the accuracy of underwater object detection. Existing research uses image enhancement methods to improve the visual quality of images, which may not be beneficial in underwater image detection and lead to serious degradation in detector performance. To alleviate this problem, we proposed a bidirectional guided method for underwater object detection, referred to as BG-YOLO. In the proposed method, a network is organized by constructing an image enhancement branch and an object detection branch in a parallel manner. The image enhancement branch consists of a cascade of an image enhancement subnet and object detection subnet. The object detection branch only consists of a detection subnet. A feature-guided module connects the shallow convolution layers of the two branches. When training the image enhancement branch, the object detection subnet in the enhancement branch guides the image enhancement subnet to be optimized towards the direction that is most conducive to the detection task. The shallow feature map of the trained image enhancement branch is output to the feature-guided module, constraining the optimization of the object detection branch through consistency loss and prompting the object detection branch to learn more detailed information about the objects. This enhances the detection performance. During the detection tasks, only the object detection branch is reserved so that no additional computational cost is introduced. Extensive experiments demonstrate that the proposed method significantly improves the detection performance of the YOLOv5s object detection network (the mAP is increased by up to 2.9%) and maintains the same inference speed as YOLOv5s (132 fps).

Keywords:

underwater object detection; underwater image enhancement; feature guided; joint optimization

1. Introduction

In underwater scenes, images suffer from wavelength-related light absorption and scattering, which results in serious degradation and impairs the accuracy of underwater object detection tasks. Some studies have attempted to address this problem and improve detection accuracy by using underwater image enhancement methods to improve the quality of degraded images. However, image enhancement tasks and object detection tasks have different goals and indicators, which leads to differences in the optimization studies and optimal solutions [1]. Therefore, adopting image enhancement as a direct pre-processing method may not effectively improve the accuracy of an object detection model [2]. Many researchers have begun to focus on combining underwater image enhancement networks and object detection networks to improve the accuracy of underwater object detection. Liu et al. [3] classified the combinations of these as follows—the separate way, the cascaded way, and the parallel way—as shown in Figure 1a–c.

As is shown in Figure 1a, the separate method introduces image enhancement as a pre-processing step before object detection. In this method, image enhancement and object detection are regarded as two individual tasks to be optimized separately, and the enhanced images are used as a training dataset to train the object detection network. This widely used combination is simple and easy to implement. However, considering that image enhancement and object detection are two individual tasks with different optimization indicators, adopting image enhancement as a pre-processing step often does not achieve intuitive improvements in detection tasks [2,4].

In the cascaded method (Figure 1b), the image enhancement network and object detection network are integrated into a single pipeline, which ensures that the two individual tasks are optimized in a common direction through joint optimization. In DE-YOLO [5] and the literature [6], an image enhancement module and detection module are integrated into a single framework in a cascaded manner, and relevant detection information from the detector is used to guide the optimization of the enhancement module in a direction that benefits the detection task, improving the accuracy of object detection. Organizing the two networks in this cascaded manner improves the performance of the detection tasks but introduces an additional computational cost in the test stage.

Different from the two methods mentioned above, in the parallel method (Figure 1c), the image enhancement network and object detection network are organized in a parallel manner; enhanced images are used to guide the training of the detection network to improve the performance of object detection in various degraded scenes [7,8]. Liu et al. [3] organized the enhancement branch and detection branch in a parallel manner and introduced a feature-guided module guiding the shallow layers of the detection branch to learn the lost details of objects using enhanced images. During the test stage, the enhancement branch and feature-guided module are removed, so no additional computational cost is introduced. In contrast to the cascaded method, in the parallel method only the detection branch is reserved, so no additional cost is introduced. However, Liu et al. adopted an individually trained enhancement network, which could not ensure that the images used for guidance were always conducive to the object detection tasks.

In response to the limitations mentioned above, we attempt to combine the advantages of the cascaded and parallel method and propose a bidirectional-guided underwater object detection method, referred to as BG-YOLO (Figure 1d). BG-YOLO consists of an image enhancement branch, object detection branch, and feature-guided module. Specifically, the image enhancement branch and object detection branch are organized in a parallel manner, and the feature-guided module connects their shallow convolution layers, optimizing the network’s training direction by constraining the low-level features of both branches. Our proposed method is significantly different from the parallel method above proposed by Liu et al. [3]. First, in Liu et al.’s method, the image enhancement branch uses a trained network, which is not always conducive to detection tasks. In contrast, in our proposed method, object detection is used to guide the training of the image enhancement branch, optimizing it in a direction conducive to object detection. Therefore, our method has better generalization ability. Additionally, Liu et al. used enhanced images to constrain the training of the detection branch through consistency loss. However, the enhanced image output from the enhancement branch has essential differences from the shallow feature map from the detection branch. In contrast, we constrain the training of the detection module using the consistency loss between the low-level features of the two branches.

To summarize, our main contributions are as follows:

We propose an object detection framework in underwater degraded scenes, BG-YOLO. Firstly, the detection tasks are used to guide the training of the image enhancement network, which makes the enhancement network conducive to detection tasks. Subsequently, the image enhancement branch and object detection branch are organized in a parallel manner, and the image enhancement branch is used to guide the training of the object detection branch. Finally, during the detection, only the object detection branch is reserved; thus, no additional computational cost is introduced.
We imposed constraints on the corresponding convolutional layer of the image enhancement branch and object detection branch, both of which have the same dimensions and underlying semantics. This enables the detection branch to learn more feature information, thereby improving its object detection performance.
Extensive experiments on URPC2019 and URPC2020 demonstrated that our proposed BG-YOLO significantly improves detection performance compared to the original detection method.

The rest of this article is organized as follows: Section 2 briefly reviews underwater image enhancement and the related existing techniques. Section 3 provides a detailed exposition of the proposed method. Section 4 presents the experiments. The conclusion is presented in Section 5.

2. Related Work

The objective of this research was to improve the performance of object detection in complex underwater environments, which mainly involves techniques relevant to underwater image enhancement and object detection. We first review the existing research findings in the field of underwater image enhancement. Then, we briefly illustrate the progress of object detection techniques in complex underwater environments. Finally, we focus on the joint optimization of underwater image enhancement and object detection tasks.

2.1. Underwater Image Enhancement

Underwater image-processing techniques can overcome the problems of image degradation to a large extent. Underwater image enhancement techniques are classified into traditional approaches and deep learning-based approaches.

Traditional underwater image enhancement approaches include non-physical model-based methods [9,10,11,12,13,14,15,16,17,18] and physical model-based methods [19,20,21]. With the wide application of deep learning in various computer vision tasks, deep learning-based algorithms have been applied to underwater image enhancement, achieving remarkable results [22,23,24,25,26]. Li et al. [27] proposed UWCNN, a convolutional neural network model for underwater image enhancement based on an underwater image prior, which directly restores clear underwater images. Espinosa et al. [28] combined the discrete wavelet transform (DWT) and proposed a variant of U-Net for underwater image enhancement in which the discrete wavelet transform is utilized in skip connection and is used to achieve de-blurring and color correction with the channel attention module. Generative adversarial networks (GAs) are also widely applied to underwater image enhancement [29,30,31]. Jiang et al. [32] proposed a domain adaption framework for real-world underwater image enhancement using CycleGAN [33] to convert underwater-style images into in-air-style images, thereby improving their quality. Image enhancement methods based on deep learning can obtain enhanced images with vivid visual effects without estimating prior parameters but require a large amount of paired data to train the networks.

2.2. Underwater Image Detection

With the development of deep learning technology, object detection algorithms based on deep learning have become widely used in underwater object detection [34,35,36]. Zeng et al. [37] introduced an adversarial occlusion network (AON) into Faster R-CNN, effectively preventing overfitting through adversarial learning and consequently achieving a more robust detection network for underwater object detection. Cao et al. [38] utilized lightweight MobileNetv2 as the backbone network of the SSD algorithm and proposed Faster MSSDLite for underwater detection tasks involving live crabs. Liu et al. [39] utilized YOLOv4 as the backbone network, using a dual-branch structure of a detection branch and tracking branch in a parallel manner to detect and track marine fish in real time. Yu et al. [40] designed an underwater object detection network, U-YOLOv7, based on YOLOv7 to meet both speed and precision requirements. In underwater scenes, the distortion of images is the main factor affecting the performance of object detection. Image enhancement can intuitively improve the visual performance of underwater images. Effectively combining image enhancement algorithms to improve the performance of object detection in low-quality underwater images remains a research objective with significant value.

2.3. Joint Optimization

Recent studies have integrated the tasks of image enhancement and object detection into one end-to-end framework, optimizing both networks jointly during training [1,41,42]. In IA-YOLO [43], a differentiable image-processing module, DIP, is introduced, which uses a small convolutional neural network, CNN-PP, to predict the parameters of DIP and achieve better detection performance through the end-to-end joint learning of CNN-PP and YOLOv3. In DE-YOLO [5], the image enhancement module DENet and the detection module YOLOv3 are organized in a cascaded manner for joint training. In [6], a CycleGAN image enhancement module and SSD detection module are integrated into one framework, and the relevant detection information from the detector is applied to guide the optimization of the enhancement module in a direction conducive to detection tasks.

End-to-end frameworks can enhance the performance of detection tasks but introduce additional computational costs. DSNet [7] utilizes a dual-subnet structure in which the recovery subnet and detection subnet are connected in parallel, sharing a common block. During training, both subnets are trained jointly, but during the detection, only the detection subnet is used. Consequently, no additional computational cost is introduced. In JADSNet [8], a joint attention-guided dual-subnet network is introduced to address the problems in marine object detection through jointly learning image enhancement and object detection tasks. The detection subnet utilizes RetinaNet as the backbone to classify and locate the objects, and the image enhancement subnet shares the feature extraction layer with the detection subnet. Liu et al. [3] organized the detection branch and enhancement branch in a parallel manner, using enhanced images to guide the lower layers of the detection branch to learn lost details. This effectively improved the precision of detection while obtaining enhanced output with excellent visual appearance; however, excellent visual appearance does not guarantee optimal object detection performance.

3. Methods

3.1. Method Overview

The research presented in [3] demonstrates that extracting low-level features is essential for detection in visually degraded scenes. To address the issue of image degradation causing difficulties in object detection in complex underwater scenes, we propose a bidirectional-guided object detection framework, transferring information between image enhancement and object detection. As shown in Figure 2, the framework consists of three parts: an image enhancement branch, an object detection branch, and a feature-guided module. The image enhancement branch and object detection branch are organized in a parallel manner. During the training, the features extracted by the enhancement branch are used to guide the detection branch to learn the low-level features and more detailed object information beneficial to the detection tasks and thereby enhance the detection performance. During the tests, we removed the image enhancement branch and feature-guided module, performing detection with only the trained detection branch. Compared to the method proposed in [6], no additional computational cost is introduced in object detection because there is no cascaded image enhancement network in our proposed framework.

Different from what is proposed in [3], we cascade an image enhancement subnet with an object detection subnet as an image enhancement branch instead of utilizing a pre-trained image enhancement network. When training the image enhancement branch, the detection subnet guides the enhancement subnet towards optimization conducive to object detection tasks. Subsequently, when training the detection branch, the parameters in the image enhancement branch are fixed.

Differently from the feature-guided module in [3], our method extracts features from the shallow convolutional layer of the object detection branch and the backbone of the detection subnet in the image enhancement branch, respectively. With the constraint of the proposed consistency loss, the low-level features of the object detection branch gradually tend towards the low-level features of the image enhancement branch, which enables the detection branch to extract more detailed information about objects.

It is worth mentioning that, different from other joint optimization methods [3,5,7], the objective of our proposed method is better performance in underwater object detection. The visual appearance of the output of the enhancement branch is not taken into consideration.

3.2. Image Enhancement Branch

A cascade of the image enhancement subnet and object detection subnet is utilized as the image enhancement branch, which makes the image enhancement subnet more conducive to improving the performance of the object detection subnet. The structure of the image enhancement branch is shown in Figure 3, which consists of an image enhancement subnet and a detection subnet (DSN).

The image enhancement step utilizes generators similar to the CycleGAN structure, as introduced in [33]. CycleGAN is constructed by two generators,

G_{U \to A}

and

G_{A \to U}

, and two adversarial discriminators,

D_{U}

and

D_{A}

. The generator

G_{U \to A}

transfers the degraded underwater images

U_{r e a l}

from the underwater domain to the in-air domain and consequently generates the output images

A_{f a k e}

. The other generator,

G_{A \to U}

, transfers in the opposite direction; to be precise, it transfers the underwater images that were previously transferred to the in-air domain back to the underwater domain, consequently generating the output

U_{r e c}

. The adversarial discriminator

D_{A}

prompts the generator

G_{U \to A}

to transform the original underwater images

U_{r e a l}

into output that is difficult to distinguish from real clear in-air images

A_{r e a l}

. Similarly, it prompts the generator

G_{A \to U}

to transform real clear in-air images

U_{r e a l}

into output that is difficult to distinguish from underwater images

U_{r e a l}

.

DSN is designed to describe the internal features and semantic information necessary for detection tasks, emphasizing more features beneficial for detection and transmitting the perception of the detector to the image enhancement subnet so that the potential output of the image enhancement subnet is more conducive to detection.

With the intention of making the features extracted by the subsequent feature-guided branch more comparable, DSN is implemented using YOLOv5s, constituting the image enhancement branch with the generator

G_{U \to A}

of CycleGAN, as shown by the blue arrows in Figure 2 and Figure 3. The pre-trained generator

G_{U \to A}

is utilized as the image enhancement module. Then, the DSN is pre-trained with the enhanced data to acquire fundamental knowledge of target classes. Finally, the framework is trained with the original underwater dataset with annotations to achieve appreciable detection performance.

3.3. Object Detection Branch

In the proposed framework, numerous existing detection networks can be used as the detection branch. The YOLO family of object detectors is widely used in underwater object detection. Some studies have shown that, for specific evaluation indicators and datasets, the performance of newer versions of the YOLO detector is not necessarily better than that of YOLOv5s [44]. YOLOv5s is an effective version for detecting underwater objects [45]. In this study, we selected YOLOv5s as the detection branch. For a detailed description of YOLOv5s, please refer to [46]. In the framework, the image enhancement branch is supposed to guide the optimization of the shallow layers of the backbone network in the object detection branch. As shown by the orange arrows in Figure 2, multiple low-level features in different scales are extracted from the shallow layers of the detection network. These low-level features extracted from the detection branch tend to be consistent with the corresponding low-level features of the enhancement branch under the guidance of the feature-guided branch, which makes the features of objects more salient for the detector and is conducive to improving the performance of object detection.

3.4. Feature-Guided Module

In images captured in an underwater environment, some features of objects are distorted or even obscured by the complex background, and the degradation of the images worsens this distortion. Therefore, these objects may be ignored by the network at the beginning. Ensuring that the shallow features of the detection network tend to be consistent with the image enhancement features that are conducive to object detection will make the features of the objects more prominent. Propagating these prominent features to the deeper layers of the object detection network can effectively improve the performance of object detection.

The feature-guided module constrains the low-level features F extracted by the detection subnet of the object detection branch to converge to I, the low-level features extracted by the detection subnet of the image enhancement branch. An overview of its structure is shown in Figure 4. The input of the feature-guided module originates from two sources: the multi-level shallow feature mappings

F_{l}, l \in {1, 2, 3}

extracted by the detection subnet of the object detection branch and the multi-level shallow feature mappings

I_{l}, l \in {1, 2, 3}

extracted by the detection subnet of the image enhancement branch.

F_{l}

and

I_{l}

are extracted from the corresponding layer of the object detection subnet; both have the same dimension and similar semantic information. Subsequently, the consistency loss is minimized to make the feature mappings,

F_{l}

and

I_{l}

, tend towards consistency. The enhancement subnet is fixed when training to constrain the shallow layers of the detection subnet to be optimized in a direction conducive to image enhancement, thereby obtaining more detailed information about the objects.

3.5. Loss Function

3.5.1. Image Enhancement Loss

The image enhancement subnet is designed to generate clear in-air images from the input degraded underwater images. CycleGAN has two generators,

G_{U \to A}

and

G_{A \to U}

, and two corresponding adversarial discriminators,

D_{U}

and

D_{A}

. For the mapping

G_{U \to A}

and its corresponding discriminator

D_{A}

, the adversarial loss can be mathematically expressed as follows:

\begin{matrix} L_{G A N} (G_{U \to A}, D_{A}, X_{U}, X_{A}) = \\ E_{x_{a} \sim p_{d a t a} (x_{a})} [log D_{A} (x_{a})] \\ + E_{x_{u} \sim p_{d a t a} (x_{u})} [log (1 - D_{A} (G_{U \to A} (x_{u}))] \end{matrix}

(1)

where mapping

G_{U \to A}

is supposed to generate images

G_{U \to A} (X_{u})

similar to images in the in-air domain

X_{A}

using images in the underwater domain

X_{U}

, and the discriminator

D_{A}

is expected to distinguish between the generated in-air domain images

G_{U \to A} (X_{u})

and the real in-air domain images

X_{a}

. Similarly, for the mapping

G_{A \to U}

and its corresponding discriminator

D_{U}

, the adversarial loss can be mathematically expressed as follows:

\begin{matrix} L_{G A N} (G_{A \to U}, D_{U}, X_{A}, X_{U}) = \\ E_{x_{u} \sim p_{d a t a} (x_{u})} [log D_{U} (x_{u})] \\ + E_{x_{a} \sim p_{d a t a} (x_{a})} [log (1 - D_{U} (G_{A \to U} (x_{a})))] \end{matrix}

(2)

The cycle consistency loss prevents the learned mappings

G_{U \to A}

and

G_{A \to U}

from being contradictory. For each image

X_{u}

from domain

X_{U}

, the forward cycle of image transfer is supposed to transform

X_{u}

back into the original image

X_{u}

. Similarly, for each image

X_{a}

from the domain

X_{A}

, the consistency loss [32] can be expressed as follows:

\begin{matrix} L_{c o n} (G_{U \to A}, G_{A \to U}) = \\ E_{x_{u} \sim P_{d a t a} (x_{u})} [{∥G_{A \to U} (G_{U \to A} (x_{u})) - x_{u}∥}_{1}] \\ + E_{x_{a} \sim P_{d a t a} (x_{a})} [{∥G_{U \to A} (G_{A \to U} (x_{a})) - x_{a}∥}_{1}] \end{matrix}

(3)

The cycle consistency loss [47] is introduced to maintain the composition of the original image through extracting features in both high and low levels using VGG-16. The cycle consistency loss can be mathematically expressed as follows:

\begin{matrix} L_{c p} (G_{U \to A}, G_{A \to U}) = \\ ∥ ϕ (x_{u}) - ϕ (G_{A \to U} (G_{U \to A} (x_{u}))) ∥_{2}^{2} \\ + ∥ ϕ (x_{a}) - ϕ (G_{U \to A} (G_{A \to U} (x_{a}))) ∥_{2}^{2} \end{matrix}

(4)

The total loss of image enhancement

L_{U I E}

can be expressed as follows:

\begin{matrix} L_{U I E} (G_{U \to A}, G_{A \to U}, D_{A}, D_{U}) = \\ L_{G A N} (G_{U \to A}, D_{A}, X_{U}, X_{A}) \\ + L_{G A N} (G_{A \to U}, D_{U}, X_{A}, X_{U}) \\ + λ_{1} L_{c o n} (G_{U \to A}, G_{A \to U}) \\ + λ_{2} L_{c p} (G_{U \to A}, G_{A \to U}) \end{matrix}

(5)

where

λ_{1}

and

λ_{2}

control the relative importance of the two losses.

3.5.2. Object Detection Loss

Both the object detection module of the image enhancement branch and the object detection branch utilize the same structure as YOLOv5s and consequently use the same loss functions as the original YOLOv5s, as shown below:

\begin{matrix} L_{d e t} & = a \cdot {LOSS}_{o b j} + b \cdot {LOSS}_{l o c} + c \cdot {LOSS}_{c l s} \end{matrix}

(6)

where

{L O S S}_{o b j}

is the confidence loss, a function of the probability that an object exists, more specifically, whether an object exists inside the predicted bounding box;

{L O S S}_{o b j}

is the classification loss, a binary cross-entropy function of the predicted probability that the object in a bounding box belongs to a class and the ground truth; and

{L O S S}_{l o c}

is a function measuring the difference between the predicted bounding boxes and the ground truth. a, b, and c are the weight factors, respectively.

3.5.3. Consistency Loss

To measure the consistency of the low-level features of the object detection branch and the enhancement branch, we utilize the mean square error (MSE) function as the consistency loss. The MSE, convex and differentiable, is widely used in metrics in regression, pattern recognition, signal and image processing, etc. We use the MSE to measure the difference between the feature maps at the pixel level and attempt to minimize it. With consistency loss, the detection network is able to understand the subtle features of the distribution of objects.

The feature-guided consistency loss is expressed as follows:

\begin{matrix} L_{c o n} = \frac{1}{h \cdot w} \sum_{i = 0}^{h - 1} \sum_{j = 0}^{w - 1} {[F_{l} (i, j) - I_{l} (i, j)]}^{2} \end{matrix}

(7)

where h and w are the height and width of the feature map, and

F_{l} (i, j)

and

I_{l} (i, j)

represent the features from the l-th level of the object detection subnet and the image enhancement subnet, respectively. The full guided loss is mathematically expressed as follows:

\begin{matrix} L_{F G M} & = μ_{1} L_{c o n 1} + μ_{2} L_{c o n 2} + μ_{3} L_{c 3} \end{matrix}

(8)

where

μ_{l}, l \in {1, 2, 3}

is used to balance the consistency loss between the different feature layers

(F_{l}, I_{l}), l \in {1, 2, 3}

.

3.5.4. Total Loss Function

The image enhancement branch and object detection branch are trained separately. When training the image enhancement branch, the image enhancement subnet and object detection subnet are first trained separately; thus, the loss functions are, respectively, the image enhancement loss and object detection loss. Subsequently, only the object detection loss is used when jointly optimizing the image enhancement subnet.

When training the object detection branch, the parameters of the image enhancement branch are fixed, and the total loss of the object detection branch is defined as the sum of the detection loss

L_{d e t}

and the consistency loss

L_{F G M}

, as shown below:

\begin{matrix} L & = η_{1} L_{d e t} + η_{2} L_{F G M} \end{matrix}

(9)

where

η_{1}

and

η_{2}

are balance factors.

4. Experiments and Discussion

4.1. Datasets

We adopted publicly accessible datasets, Underwater Robot Picking Contest 2019 (URPC2019) and Underwater Robot Picking Contest 2020 (URPC2020), to evaluate the performance of our proposed object detection framework. The URPC datasets are widely used to evaluate the performance of object detection methods in underwater scenes, and can be downloaded from http://www.cnurpc.org, accessed on 11 November 2024.

The URPC2019 dataset contains 3765 images in the training set and 942 in the testing set, covering five classes of underwater targets: echinus, starfish, holothurian, scallop, and waterweeds. The images in the dataset present adverse characteristics including color deviation, blurriness, low contrast, clustered objects, and occlusion.

The URPC2020 dataset includes a training set consisting of 4200 randomly chosen images and a testing set consisting of 800 randomly chosen images. This dataset covers four different kinds of underwater targets: holothurian, echinus, scallop, and starfish.

4.2. Implementation Details

We implemented our framework using Python-3.8.10, Torch-1.10.0, with an Intel Xeon(R) Platinum 8255C CPU @ 2.50 GHz, 43 GB of memory, and an Nvidia GeForce RTX3090 (Santa Clara, CA, USA).

When training BG-YOLO, first, joint optimization was applied on the image enhancement branch, and then, the pre-trained model was used to guide the training of the object detection branch. The image enhancement subnet is a publicly released CycleGAN network model, and the detection subnet is a publicly released YOLOv5s model.

4.2.1. Training of the Image Enhancement Branch

To train the image enhancement branch, the image enhancement subnet and object detection subnet were first separately trained. The joint optimization of the two followed.

When training the image enhancement branch, the training details in [6,32] were partly referenced. The images for training were from the dataset UIEB [48] and the dataset EUVP [49]. UIEB contains 890 pairs of images, with each pair containing one clear image and a corresponding blurred one. EUVP contains 11,435 degraded/clear images on various backgrounds. We randomly chose 400 unpaired degraded/clear images from UIEB and 600 from EUVP as the training set. The images were then scaled to 416 × 416. During the experiments, we found that when we chose the clear in-air image as the target domain image, the transferred image suffered from obvious artifacts, distortion, and checkerboard effects, severely affecting the performance of the detector. We finally chose the clear underwater images as the target domain. When training the image enhancement subnet, the Adam optimizer was used. We trained the subnet for 50 epochs, with batch size = 2, momentum

β_{1} = 0.5

, learning rate lr = 1 ×

10^{- 4}

, and the weights

λ_{1} = 5 \times 10^{- 5}

and

λ_{2} = 1

.

When training the detection subnet of the image enhancement branch, we chose the optimal configurations according to the test results referring to the training details in [6]. The URPC2019 and URPC2020 datasets were used. We first used the pre-trained enhancement subnet mentioned above to enhance the images from the URPC datasets, and then, the enhanced images served as the input of the detection subnet. The optimizer utilized was SGD. We trained the subnet based on the publicly released pre-trained model yolov5s.pt for 300 epochs, with batch size = 16, momentum = 0.937, learning rate lr = 1 ×

10^{- 2}

, lrf = 1 ×

10^{- 2}

, and weight decay = 5 ×

10^{- 4}

.

When performing joint optimization on the image enhancement branch, we chose the optimal configurations according to the test results referring to the training details in [6]. The URPC2019 and URPC2020 datasets were used. When training, the pre-trained enhancement subnet model and pre-trained detection subnet model mentioned above were first loaded. The optimizer used was SGD. We trained the branch for 300 epochs with batch size = 16, momentum = 0.937, learning rate lr = 1 ×

10^{- 3}

, lrf = 1 ×

10^{- 3}

, and weight decay = 5 ×

10^{- 4}

.

4.2.2. Training of the Object Detection Branch

The object detection branch consisted of only one detection subnet. When training, the pre-trained model of the image enhancement branch obtained through joint optimization was loaded. The parameters of the enhancement branch were fixed, and optimization was only applied to the object detection branch. The URPC2019 and URPC2020 datasets were used. The optimizer used was SGD. We trained the branch for 300 epochs with batch size = 16, momentum = 0.937, learning rate lr = 1 ×

10^{- 2}

, lrf = 1 ×

10^{- 2}

, and weight decay = 5 ×

10^{- 4}

. The object detection branch adopted the publicly released pre-trained model yolov5s.pt.

4.2.3. Comparison

To evaluate the performance of our proposed method, we reproduced the corresponding proposed algorithms according to the details in [3,6] for comparison. However, we utilized YOLOv5s as the object detection network in all these algorithms to control variables and facilitate comparison, which would not have affected the final conclusion.

4.3. Evaluation Indices

To comprehensively and objectively evaluate the performance of our proposed method, we used the mean average precision (mAP), recall, precision, F1-score, PR curve, and detection speed (FPS). When testing the detection speed, the batch size was set to 1.

4.4. Visualized Comparison

In this section, we first compare the visualized detection results for the original YOLOv5s, the separate way, the cascaded way, the parallel way, and the proposed BG-YOLO on the URPC2019 dataset. The visualized results in Figure 5 show that our proposed method can better detect clustered and occluded objects in degraded underwater images.

Figure 6 shows the visualized results for four images sampled from the URPC2020 dataset. It can be seen that BG-YOLO shows better detection performance in severely degraded underwater scenes, regardless of whether the objects are small, clustered, or occluded.

The visualized results demonstrate that BG-YOLO achieves the best performance among the approaches mentioned above. This is because our method optimizes the enhancement network in the way most conducive to detection when performing joint optimization, which alleviates the image enhancement’s performance impact on the detection task. When guiding the object detection branch, BG-YOLO makes the low-level features of the detection subnet tend to be more conducive to detection tasks, rather than focusing on the enhanced image itself.

4.5. Quantitative Comparison

4.5.1. Results for URPC2019 Dataset

We utilized the third-party publicly released YOLOv5s as the basic network model. We conducted tests on the URPC2019 dataset, comparing the detection performance of the original YOLOv5s, the separate way, the cascaded way, the parallel way, and BG-YOLO. When testing, we first resized the images to 416 × 416, and the division of the dataset was the same as that of the original dataset, with 3765 images in the training set and 942 in the validation set (testing set). The cascaded way and parallel way are reproductions of the methods proposed in [3,6], respectively. The test results are shown in Table 1.

From Table 1, it can be seen that the [email protected] of our proposed BG-YOLO is 78.4%, and the [email protected]–0.95 is 44.7%, marking improvements of 3.1% and 2.7%, respectively, compared to the baseline YOLOv5s. Compared to the separate method, cascaded method, and parallel method, BG-YOLO achieved improvements of 3.7% and 5.1%, 0.6% and 4.4%, and 1.5% and 2.6%, respectively. Additionally, the F1-score of BG-YOLO was also the best among the methods mentioned above, which demonstrates that BG-YOLO achieves an optimal balance between detection precision and recall. The analysis of the visualized detection results (Figure 5) shows that the original YOLOv5s achieves a higher recall rate than BG-YOLO; however, its false positive rate in detection is also higher. Additionally, the inference speed of BG-YOLO is also the fastest among these methods, reaching 130 fps, while the speed of the cascaded way is only 42 fps because of the computational cost incurred by combining UIE and UOD into one framework.

From Figure 7, it can be seen that the area enclosed by the PR curve of BG-YOLO (purple) and the coordinate axes is the largest, indicating that BG-YOLO achieves the best performance among these methods.

4.5.2. Results for URPC2020 Dataset

We further conducted tests on URPC2020 in the same way as described in the previous section. The test results are shown in Table 2.

As is shown in Table 2, the [email protected] of our proposed BG-YOLO is 80.2%, which is 0.7% higher than baseline, and 5.4% and 4.3% higher than those of the separate way and parallel way, respectively. Though slightly lower than the cascaded way (0.2%) in terms of [email protected], BG-YOLO is three times as fast. The PR curves shown in Figure 8 also demonstrate the performance of BG-YOLO in comparison to other methods.

It is worth mentioning that the parallel approach reproduced with reference to [3] performed even worse than the original YOLOv5s. This is because we did not adopt a particular enhancement algorithm when reproducing it, which also demonstrates that the performance of the approach in [3] depends on the enhancement algorithm chosen.

4.5.3. Comparison with Other Algorithms

We also compared the performance of BG-YOLO with that of other algorithms on the URPC2019 and URPC2020 datasets. The compared algorithms include YOLOv7, YOLOv8s, the algorithm in [6], and the algorithm in [3]. YOLOv7 and YOLOv8s both use pre-trained models and default training parameters published by third parties. We reproduced the algorithms in [3,6] and trained them according to the experimental details described in the literature. The experimental results are presented in Table 3.

From the test results shown in Table 3, it can be observed that BG-YOLO performed better than all the other algorithms on the URPC2019 dataset, with the [email protected] and [email protected]–0.95 scores exceeding those of the compared algorithms by up to 5.3% and 3%, respectively. When the method proposed in this chapter was run on the URPC2020 dataset, its [email protected] and [email protected]–0.95 were better than those in [6] and slightly lower than those in [3], but the inference speed of our method far exceeded that in [3]. The above data indicate that BG-YOLO can significantly improve the performance of object detection in low-quality underwater images. It is notable that, on the URPC2020 dataset, BG-YOLO’s [email protected] and [email protected]–0.95 scores were lower than those of YOLOv7 and YOLOv8, which is mainly due to the advantages conferred by advanced object detection algorithms and does not affect the conclusions drawn from the algorithms in this chapter.

It is notable that the detection performance of the parallel organization method as shown in Table 1, Table 2 and Table 3 was even lower than that of the original YOLOv5s. Upon analyzing the enhanced dataset intended for guidance, it was observed that the enhanced images introduced numerous artifacts, noise, and checkerboard patterns, all of which negatively impacted the performance of the ultimate detection task. This suggests that the detection performance of the algorithm presented in [3] is constrained by the outcomes of image enhancement. In other words, the enhanced branch of BG-YOLO can guide the enhancement subnet to optimize in a direction more conducive to detection tasks, thus compensating for the shortcomings of the enhancement subnet without special consideration of the performance of the enhancement algorithm.

4.5.4. Effects of Feature-Guided Layer

To evaluate the influence of different shallow convolutional layers

f_{1}

and

I_{1}

, we tested the difference in guiding the training of BG-YOLO by extracting features

(F_{l}, I_{l}), l \in {1, 2, 3}

from the first convolutional layer conv1, the second convolutional layer conv2, and the C3 module of the enhancement branch and the corresponding detection network in the detection branch on the URPC2019 dataset. The results are presented in Table 4. In these experiments, the weight factors of the detection loss

L_{d e t}

and consistency loss

L_{F G M}

were

λ_{1} = λ_{2} = 1.0

, and the weight factors of

L_{c o n 1}

,

L_{c o n 2}

, and

L_{C 3}

were

μ_{1} = μ_{2} = μ_{3} = 1.0

.

As shown in Table 4, each combination of the feature-guided layers presented above contributes to improving the detection performance, which indicates that guiding the low-level features of the detection network towards image enhancement is conducive to detection tasks. When using only the features of the first convolutional layer for guidance, the best detection performance is achieved. In degraded underwater scenes, the main reason for the decline in object detection performance is that the degraded underwater images lack detailed features conducive to object detection. Additionally, deeper convolutional layers focus more on semantic information, which does not significantly contribute to improving object detection performance.

4.5.5. Effects of Different $η$

According to formula (9), the values of

η_{1}

and

η_{2}

balance the effects of the object detection branch and the enhancement branch. We fixed

η_{1}

and tested the contribution of the guiding of the enhancement branch with different values of

η_{2}

. Considering the results of the previous section, only the shallowest convolutional layer was used for feature guidance. The dataset used in these experiments was URPC2019. The results are presented in Table 5.

The results presented in Table 5 indicate that it is favorable for detection tasks when

η_{2}

varies from 0.01 to 1.0, and when

η_{2} = 0.05

, the model reaches the highest [email protected], 78.4%.

5. Conclusions

The degradation of images is a significant challenge for underwater object detection tasks. Using image enhancement methods to improve the visual effects of underwater images is not necessarily conducive to improving detection performance. To address the above issues, we propose a bidirectional guided object detection method, which combines the advantages of both a cascaded organization and parallel organization of the image enhancement and object detection networks. More specifically, we organize the enhancement branch and detection branch in parallel, and the enhancement branch is a combination of the enhancement subnet and detection subnet. The detection branch contains only one detection subnet. The feature-guided module guides the low-level features to be optimized towards the corresponding low-level features of the enhancement branch, in order to make the detection branch sensitive to the quality of the images and object detection. During the inference, the enhancement branch and feature-guided module are removed, and only the detection branch is used. Extensive experiments demonstrate that our proposed framework can significantly improve the precision of underwater object detection without introducing additional computational costs, which demonstrates the effectiveness of our proposed method.

Underwater images are easily degraded by complex underwater environments, which is not conducive to subsequent object detection tasks. Image visual quality evaluation and underwater image enhancement require further research to improve image processing in object detection.

Author Contributions

Conceptualization, R.C. and J.Z.; methodology, J.Z.; software, R.Z., X.Y. and R.C.; validation, R.C., R.Z. and X.Y.; formal analysis, X.Y.; investigation, R.C. and X.Y.; resources, X.Y. and R.Z.; data curation, X.Y.; writing—original draft preparation, R.Z., X.Y. and R.C.; writing—review and editing, J.Z. and X.Y.; supervision, J.Z.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hainan province, grant number 623RC449.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, L.; Jiang, Z.; Tong, L.; Liu, Z.; Zhao, A.; Zhang, Q.; Dong, J.; Zhou, H. Perceptual underwater image enhancement with deep learning and physical priors. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3078–3092. [Google Scholar] [CrossRef]
Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Liu, H.; Jin, F.; Zeng, H.; Pu, H.; Fan, B. Image enhancement guided object detection in visually degraded scenes. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 14164–14177. [Google Scholar] [CrossRef]
Liu, H.; Song, P.; Ding, R. Towards domain generalization in underwater object detection. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 1971–1975. [Google Scholar]
Qin, Q.; Chang, K.; Huang, M.; Li, G. DENet: Detection-driven enhancement network for object detection under adverse weather conditions. In Proceedings of the Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 2813–2829. [Google Scholar]
Liu, R.; Jiang, Z.; Yang, S.; Fan, X. Twin adversarial contrastive learning for underwater image enhancement and beyond. IEEE Trans. Image Process. 2022, 31, 4922–4936. [Google Scholar] [CrossRef]
Huang, S.C.; Le, T.H.; Jaw, D.W. DSNet: Joint semantic learning for object detection in inclement weather conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2623–2633. [Google Scholar] [CrossRef]
Cheng, N.; Xie, H.; Zhu, X.; Wang, H. Joint image enhancement learning for marine object detection in natural scene. Eng. Appl. Artif. Intell. 2023, 120, 105905. [Google Scholar] [CrossRef]
Vasamsetti, S.; Mittal, N.; Neelapu, B.C.; Sardana, H.K. Wavelet based perspective on variational enhancement technique for underwater imagery. Ocean Eng. 2017, 141, 88–100. [Google Scholar] [CrossRef]
Hummel, R. Image enhancement by histogram transformation. Unknown 1975. [CrossRef]
Ketcham, D.J.; Lowe, R.W.; Weber, J.W. Image enhancement techniques for cockpit displays. Hughes Aircraft Co Culver City Ca Display Systems Lab 1974, 6. [Google Scholar] [CrossRef]
Pizer, S.M.; Amburn, E.P.; Austin, J.D.; Cromartie, R.; Geselowitz, A.; Greer, T.; ter Haar Romeny, B.; Zimmerman, J.B.; Zuiderveld, K. Adaptive histogram equalization and its variations. Comput. Vision Graph. Image Process. 1987, 39, 355–368. [Google Scholar] [CrossRef]
Land, E.H.; McCann, J.J. Lightness and retinex theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef] [PubMed]
Iqbal, K.; Odetayo, M.; James, A.; Salam, R.A.; Talib, A.Z.H. Enhancing the low quality images using unsupervised colour correction method. In Proceedings of the 2010 IEEE International Conference on Systems, Man and Cybernetics, Istanbul, Turkey, 10–13 October 2010; pp. 1703–1709. [Google Scholar]
Huang, D.; Wang, Y.; Song, W.; Sequeira, J.; Mavromatis, S. Shallow-water image enhancement using relative global histogram stretching based on adaptive parameter acquisition. In Proceedings of the MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5–7, 2018, Proceedings, Part I 24; Springer: Berlin/Heidelberg, Germany, 2018; pp. 453–465. [Google Scholar]
Zhang, S.; Wang, T.; Dong, J.; Yu, H. Underwater image enhancement via extended multi-scale Retinex. Neurocomputing 2017, 245, 1–9. [Google Scholar] [CrossRef]
Liu, K.; Li, X. De-hazing and enhancement method for underwater and low-light images. Multimed. Tools Appl. 2021, 80, 19421–19439. [Google Scholar] [CrossRef]
Zhang, W.; Dong, L.; Xu, W. Retinex-inspired color correction and detail preserved fusion for underwater image enhancement. Comput. Electron. Agric. 2022, 192, 106585. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar]
Galdran, A.; Pardo, D.; Picón, A.; Alvarez-Gila, A. Automatic red-channel underwater image restoration. J. Vis. Commun. Image Represent. 2015, 26, 132–145. [Google Scholar] [CrossRef]
Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 825–830. [Google Scholar]
Guo, P.; Zeng, D.; Tian, Y.; Liu, S.; Liu, H.; Li, D. Multi-scale enhancement fusion for underwater sea cucumber images based on human visual system modelling. Comput. Electron. Agric. 2020, 175, 105608. [Google Scholar] [CrossRef]
Gangisetty, S.; Rai, R.R. FloodNet: Underwater image restoration based on residual dense learning. Signal Process. Image Commun. 2022, 104, 116647. [Google Scholar] [CrossRef]
Xu, S.; Zhang, J.; Qin, X.; Xiao, Y.; Qian, J.; Bo, L.; Zhang, H.; Li, H.; Zhong, Z. Deep retinex decomposition network for underwater image enhancement. Comput. Electr. Eng. 2022, 100, 107822. [Google Scholar] [CrossRef]
Wu, S.; Luo, T.; Jiang, G.; Yu, M.; Xu, H.; Zhu, Z.; Song, Y. A two-stage underwater enhancement network based on structure decomposition and characteristics of underwater imaging. IEEE J. Ocean. Eng. 2021, 46, 1213–1227. [Google Scholar] [CrossRef]
Xue, X.; Hao, Z.; Ma, L.; Wang, Y.; Liu, R. Joint luminance and chrominance learning for underwater image enhancement. IEEE Signal Process. Lett. 2021, 28, 818–822. [Google Scholar] [CrossRef]
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Espinosa, A.R.; McIntosh, D.; Albu, A.B. An efficient approach for underwater image improvement: Deblurring, dehazing, and color correction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 206–215. [Google Scholar]
Zhang, H.; Sun, L.; Wu, L.; Gu, K. DuGAN: An effective framework for underwater image enhancement. IET Image Process. 2021, 15, 2010–2019. [Google Scholar] [CrossRef]
Sun, B.; Mei, Y.; Yan, N.; Chen, Y. UMGAN: Underwater image enhancement network for unpaired image-to-image translation. J. Mar. Sci. Eng. 2023, 11, 447. [Google Scholar] [CrossRef]
Li, J.; Skinner, K.A.; Eustice, R.M.; Johnson-Roberson, M. WaterGAN: Unsupervised generative network to enable real-time color correction of monocular underwater images. IEEE Robot. Autom. Lett. 2017, 3, 387–394. [Google Scholar] [CrossRef]
Jiang, Q.; Zhang, Y.; Bao, F.; Zhao, X.; Zhang, C.; Liu, P. Two-step domain adaptation for underwater image enhancement. Pattern Recognit. 2022, 122, 108324. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Huang, H.; Zhou, H.; Yang, X.; Zhang, L.; Qi, L.; Zang, A.Y. Faster R-CNN for marine organisms detection and recognition using data augmentation. Neurocomputing 2019, 337, 372–384. [Google Scholar] [CrossRef]
Hu, K.; Lu, F.; Lu, M.; Deng, Z.; Liu, Y. A marine object detection algorithm based on SSD and feature enhancement. Complexity 2020, 2020, 5476142. [Google Scholar] [CrossRef]
Wang, L.; Ye, X.; Xing, H.; Wang, Z.; Li, P. Yolo nano underwater: A fast and compact object detector for embedded device. In Proceedings of the Global Oceans 2020: Singapore–US Gulf Coast, Biloxi, MS, USA, 5–30 October 2020; pp. 1–4. [Google Scholar]
Zeng, L.; Sun, B.; Zhu, D. Underwater target detection based on Faster R-CNN and adversarial occlusion network. Eng. Appl. Artif. Intell. 2021, 100, 104190. [Google Scholar] [CrossRef]
Cao, S.; Zhao, D.; Liu, X.; Sun, Y. Real-time robust detector for underwater live crabs based on deep learning. Comput. Electron. Agric. 2020, 172, 105339. [Google Scholar] [CrossRef]
Liu, T.; Li, P.; Liu, H.; Deng, X.; Liu, H.; Zhai, F. Multi-class fish stock statistics technology based on object classification and tracking algorithm. Ecol. Inform. 2021, 63, 101240. [Google Scholar] [CrossRef]
Yu, G.; Cai, R.; Su, J.; Hou, M.; Deng, R. U-YOLOv7: A network for underwater organism detection. Ecol. Inform. 2023, 75, 102108. [Google Scholar] [CrossRef]
Zhang, X.; Fang, X.; Pan, M.; Yuan, L.; Zhang, Y.; Yuan, M.; Lv, S.; Yu, H. A marine organism detection framework based on the joint optimization of image enhancement and object detection. Sensors 2021, 21, 7205. [Google Scholar] [CrossRef] [PubMed]
Yeh, C.H.; Lin, C.H.; Kang, L.W.; Huang, C.H.; Lin, M.H.; Chang, C.Y.; Wang, C.C. Lightweight deep neural network for joint learning of underwater object detection and color conversion. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6129–6143. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-adaptive YOLO for object detection in adverse weather conditions. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1792–1800. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Gašparović, B.; Mauša, G.; Rukavina, J.; Lerga, J. Evaluating Yolov5, Yolov6, Yolov7, and Yolov8 in underwater environment: Is there real improvement? In Proceedings of the 2023 8th International Conference on Smart and Sustainable Technologies (SpliTech), Split, Croatia, 20–23 June 2023; pp. 1–4. [Google Scholar]
Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J.; Skalski, P.; Hogan, A.; et al. ultralytics/yolov5: v6. 0-YOLOv5n’Nano’models, Roboflow integration, TensorFlow export, OpenCV DNN support. Zenodo 2021. [Google Scholar] [CrossRef]
Engin, D.; Genç, A.; Kemal Ekenel, H. Cycle-dehaze: Enhanced cyclegan for single image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 825–833. [Google Scholar]
Ancuti, C.; Ancuti, C.O.; Haber, T.; Bekaert, P. Enhancing underwater images and videos by fusion. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 81–88. [Google Scholar]
Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]

Figure 1. Different ways of combining underwater image enhancement and object detection.

Figure 2. Overview of BG-YOLO framework.

Figure 3. Overview of the image enhancement branch.

Figure 4. Overview of the feature-guided module.

Figure 5. Visualized detection results for different methods on the URPC2019 dataset.

Figure 6. Isualized detection results for different methods on the URPC2020 dataset.

Figure 7. PR curve of the test results for URPC2019.

Figure 8. PR curve of the test results for URPC2020.

Table 1. Test results for URPC2019.

Method	[email protected] (%)	[email protected]–0.95 (%)	Recall (%)	F1-Score (%)	Precision (%)	FPS (fps)
YOLOv5s	75.3	42.0	72.6	74.0	74.8	130
Separate Way	74.7	39.6	68.6	75.0	83.4	130
Cascaded Way	76.9	42.8	72.7	76.0	81.0	42
Parallel Way	73.1	41.7	66.3	72.0	78.7	130
BG-YOLO	78.4	44.7	70.1	77.0	88.3	130

Table 2. Test results for URPC2020.

Method	[email protected] (%)	[email protected]–0.95 (%)	Recall (%)	F1-Score (%)	Precision (%)	FPS (fps)
YOLOv5s	79.5	44.9	73.8	78.0	83.7	132
Separate Way	74.8	40.3	69.1	75.0	81.1	132
Cascaded Way	80.4	44.8	75.7	79.0	82.0	42
Parallel Way	75.9	38.4	68.1	73.0	79.7	132
BG-YOLO	80.2	44.6	75.2	79.0	82.7	132

Table 3. Performance comparison of different algorithms on URPC2019 and URPC2020 datasets.

Method	[email protected] (%)	[email protected]–0.95 (%)	URPC2020–[email protected] (%)	[email protected]–0.95 (%)
YOLOv5s	75.3	42.0	79.5	44.9
Algorithm in [6]	76.9	42.8	80.4	44.8
Algorithm in [3]	73.1	41.7	75.9	38.4
YOLOv7	77.5	43.2	81.3	44.1
YOLOv8	77.9	45.5	81.8	47.6
BG-YOLO	78.4	44.7	80.2	44.6

Table 4. Contribution of feature-guided layers to detection precision; “√” indicates that the operation has been performed.

conv1	conv2	C3	[email protected] (%)	[email protected]–0.95 (%)
			75.3	42.0
√			78.0	43.8
	√		77.7	44.2
		√	77.0	43.8
√	√		76.0	43.5
√		√	76.8	41.6
	√	√	76.6	43.5
√	√	√	76.6	43.6

Table 5. Detection results with different

η_{2}

.

Table 5. Detection results with different

η_{2}

.

$η_{2}$	[email protected]	[email protected]–0.95
1.0	78.0	43.8
0.5	77.4	44.6
0.1	77.6	43.6
0.05	78.4	44.7
0.01	76.6	44.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, R.; Zhang, R.; Yan, X.; Zhang, J. BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection. Sensors 2024, 24, 7411. https://doi.org/10.3390/s24227411

AMA Style

Cao R, Zhang R, Yan X, Zhang J. BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection. Sensors. 2024; 24(22):7411. https://doi.org/10.3390/s24227411

Chicago/Turabian Style

Cao, Ruicheng, Ruiteng Zhang, Xinyue Yan, and Jian Zhang. 2024. "BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection" Sensors 24, no. 22: 7411. https://doi.org/10.3390/s24227411

APA Style

Cao, R., Zhang, R., Yan, X., & Zhang, J. (2024). BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection. Sensors, 24(22), 7411. https://doi.org/10.3390/s24227411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Underwater Image Enhancement

2.2. Underwater Image Detection

2.3. Joint Optimization

3. Methods

3.1. Method Overview

3.2. Image Enhancement Branch

3.3. Object Detection Branch

3.4. Feature-Guided Module

3.5. Loss Function

3.5.1. Image Enhancement Loss

3.5.2. Object Detection Loss

3.5.3. Consistency Loss

3.5.4. Total Loss Function

4. Experiments and Discussion

4.1. Datasets

4.2. Implementation Details

4.2.1. Training of the Image Enhancement Branch

4.2.2. Training of the Object Detection Branch

4.2.3. Comparison

4.3. Evaluation Indices

4.4. Visualized Comparison

4.5. Quantitative Comparison

4.5.1. Results for URPC2019 Dataset

4.5.2. Results for URPC2020 Dataset

4.5.3. Comparison with Other Algorithms

4.5.4. Effects of Feature-Guided Layer

4.5.5. Effects of Different η

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.5.5. Effects of Different $η$