1. Introduction
The detection and identification of seafloor targets play an extremely important role in underwater search and rescue, marine engineering construction, marine topography and geomorphology measurements, marine resource investigation, etc. However, affected by the complex marine environment, imaging conditions, and measurement means, accuracy and efficiency are relatively low, which makes it difficult to meet the demand; so, improving these aspects has become a crucial research hotspot [
1,
2,
3,
4,
5,
6]. Side-scan sonar has gained widespread application in seabed target detection due to its affordability, rapid coverage of large areas, and its independence from underwater visibility, as it relies instead on acoustic imaging. This technology is crucial to locate the remains of aircraft, ships, and individuals; pinpoint underwater pipelines; and detect submerged reefs, ores, and mines [
7,
8,
9,
10]. Nevertheless, the current method of identifying underwater targets in side-scan sonar imagery relies heavily on manual inspection, which is not only subjective and time-consuming but also hinders broader adoption, particularly in scenarios requiring real-time detection, such as those featuring autonomous underwater vehicles (AUVs) [
11].
With the rapid advancement of computer vision technology and the development of convolutional neural networks, deep learning approaches have become prevalent in the task of seabed target recognition in side-scan sonar images, offering substantial improvements in accuracy and efficiency over traditional methods [
12,
13,
14,
15]. Deep learning algorithms for target detection are broadly categorized into two types: two-stage and one-stage models. However, in practice, because it is necessary to generate a regional suggestion network of candidate target frames and then perform their classification and bounding box regression to achieve target detection with the former, there are problems such as complex structure, a significantly large computational volume, and long computation time; therefore, these models are seldom used in practical engineering applications [
16]. The core advantage of single-stage target detection models is, firstly, their speed, which allows for efficient and instantaneous detection on account of the simplification of the detection process and a reduction in the consumption of computational resources, which makes these models especially suitable for application scenarios with high real-time requirements. Secondly, their structure is relatively simple and easy to implement and deploy, and the number of hyperparameters that need to be adjusted is reduced, in turn reducing the complexity of model tuning [
17]. In terms of detection accuracy, single-stage models tend to demonstrate superior performance in specific scenarios, such as small- and dense-target detection. In addition, single-stage models demonstrate great adaptability and flexibility, can be easily adapted to different task requirements, and are easy to integrate with other image-processing systems or platforms.
Although generic target detection models have achieved certain results in the field of natural or remote sensing imagery, target detection based on side-scan sonar images still presents multiple challenges [
18,
19,
20].
(1) Side-scan sonar usually relies on mobile platforms such as AUVs to implement detection, and in current AUV seabed obstacle detection missions, traditional intelligent detection models often present deployment difficulties, high power consumption, and slow processing speed due to the large number of parameters and high computational complexity [
21,
22]. These deficiencies limit the real-time performance and efficiency of AUVs in complex marine environments and increase energy consumption, which affects the operating time and stability of AUVs. Therefore, research on lightweight intelligent detection models becomes particularly important.
(2) Although AUVs are equipped with various detection devices, such as forward-looking sonar, side-scan sonar, and optical cameras [
23,
24], in some specific mission scenarios, in order to perform detection over as large an area of the seafloor as possible in a short period of time, side-scan sonar with a wide detection range is typically used, which means that the poor features of sparse targets need to be extracted from side-scan sonar images to perform seafloor target detection. Therefore, determining how to make full use of side-scan sonar images to extract more feature information is the key to improving the accuracy of undersea target detection models.
(3) Side-scan sonar continuously sends and receives acoustic signals during AUV traveling, and the collected data are processed and superimposed to form a waterfall image, which means that the original image of the side-scan sonar is usually large in size; a large amount of fine-grained information is lost if the whole image is directly inputted into the network, which also leads to high leakage of small undersea targets (targets with less than 50 pixels × 50 pixels in the image) [
25]. Therefore, how to optimize the detection strategy is also an important aspect in solving the problem of real-time undersea target detection based on side-scan sonar.
Aiming at solving the above problems, in this study, we developed a new target detection model, DBnet.
(1) We adopted two kinds of lightweight backbone networks and optimized the neck part of the baseline model. By streamlining the structure, the model in this study presents significantly fewer parameters and less computation while maintaining detection accuracy, achieving a balance between the latter and speed, and meeting the needs of practical engineering applications.
(2) To address the challenge of feature extraction with less valid information in side-scan sonar images, in this study, we developed a dual-backbone network structure. The structure makes use of multiple feature extraction paths to simulate multimodal data fusion so that even if only side-scan sonar images are used as input, an effect similar to that of multimodal data fusion can be achieved, thus improving the diversity of feature extraction and the performance of the detection model.
(3) In order to solve the problem of the large size of the original waterfall map in side-scan sonar, in this study, we adopted the slice-assisted hyper-inference (SAHI) technique, which splits large-size images into multiple small-size images, performs network inference separately, and fuses the detection results of each slice.
2. Related Work
In the field of target detection in side-scan sonar images, the study of lightweight target detection models has gradually become a hotspot. Li [
26] significantly improved detection speed and accuracy by replacing the backbone network of YOLO v8s with the GhostNet structure. Moreover, the lightweight attention mechanism Triplet Attention was introduced to optimize feature extraction, and the ECIoU loss function was employed to improve the convergence and recognition accuracy of the model. Yu [
11] proposed a real-time automatic target recognition method (TR-YOLOv5s) combining the Transformer module and YOLOv5s to address the problems of target-sparse and feature-poor side-scan sonar images. By introducing an attention mechanism, the focus on target features is enhanced, which improves detection accuracy and efficiency. Zhang et al. [
27] combined the Swin Transformer with the YOLO framework for marine target detection. This method allows for the extraction of discriminative features under ocean clutter interference, a reduction in computational complexity, and an improvement in target detection accuracy. Huang [
28] employed the Dual Segmented Attention (DSA) mechanism, which efficiently extracts target features through the parallel processing of channel and spatial attention and enhances the ability to extract features with weak boundaries. Li [
29] combined Spatial Pyramid Pooling (SPP) and Online Dataset Preprocessing (ODP) for underwater target detection in side-scan sonar images. The method overcomes the input image size limitation and improves the feature extraction capability with SPP, while the diversity and complexity of the dataset is enhanced with ODP, thus improving detection accuracy and efficiency. Ji et al. [
30] introduced YOLO-TLA, a lightweight model which includes an extra layer specifically designed for detecting small targets and incorporates the C3CrossCovn module with a global attention mechanism in its backbone network. This design reduces technical complexity and parameter count while enhancing the focus on relevant object attributes and filtering out unnecessary information. Zhang [
31] and others applied a combination of the lightweight backbone network Mobile v2 and deep separable convolution on top of the YOLOv4 algorithm, which significantly reduces the number of model parameters, and introduced an attention mechanism in the FPN to learn richer features of small targets. Tang et al. [
32] presented a multi-scale sensory field convolutional block attention mechanism, known as AMMRF, which leverages feature map position information to precisely capture inter-channel relationships and enhance the learning of ship–background interactions. In their YOLO-SARSI model, they integrated the AMMRF module within the backbone network for feature fusion, simplifying the baseline model’s complex components. This approach effectively decreased the number of parameters and computational load. The intricate feature fusion component of the baseline model was omitted, resulting in a significant reduction in both parameter count and computational complexity.
3. Materials and Methods
The operation flow chart, including an AUV carrying side-scan sonar and the proposed algorithm for the actual measurement process, is shown in
Figure 1, and mainly includes the data acquisition and collation part and the use of our algorithm for real-time target detection, performed in two steps. The side-scan sonar data collected by the AUV are spliced into a map; then, the image is sliced and processed with SAHI and inputted into DBnet for the detection of undersea obstacles. Once the AUV detection mission is over, DBnet can be retrained by using the database constructed from the detected target data and the existing data, expanding the sample size in order to improve the detection accuracy of the model.
3.1. DBnet
The DBnet detector presented here is implemented within the YOLOv8 network architecture. The YOLOv8 model consists of four primary components, i.e., input, backbone, neck, and output, each incorporating various modules, such as the Conv module, the C2f module, and the SPPF module [
33]. One of the innovations in YOLOv8 is the introduction of the C2f structure, which is pivotal for residual feature learning and allows for the efficient capture of gradient flow information. The backbone, comprising five Conv modules, four C2f structures, and one SPPF structure, is responsible for extracting generic target features. The SPPF structure, located in the final layer of the backbone, collects information from sensory fields of different sizes (5, 9, and 13) through a series of consecutive 5 × 5 convolutional kernel max-pooling operations. These feature layers are then combined with their unprocessed counterparts to integrate multi-scale feature information and enhance model performance. The neck segment, positioned between the backbone and the prediction component, is designed to diversify features and bolster model robustness. It incorporates four C2f modules, two Conv modules, and two Upsample operations. Finally, the prediction component serves as the output end of the model and is responsible for delivering the final target detection results. In essence, the neck enriches feature diversity and improves model robustness by incorporating various modules, while the prediction component is responsible for the output of the detection results.
In this study, we developed the DBnet model to address the problems of existing models, including the difficulty in extracting effective information from feature-poor and target-sparse side-scan sonar images, the challenging deployment of these models on mobile platforms with limited computing power such as AUVs, low detection efficiency, significant target leakage, etc. For our model, we designed a parallel lightweight two-branch backbone structure, streamlined the original neck part to achieve high efficiency with a simple structure, and developed the SAHI algorithm for the characteristics of the original waterfall map of side-scan sonar, which greatly improves the accuracy of small target detection, as well as the efficiency. The network structure is shown in
Figure 2.
Firstly, we chose PP-LCNet and GhostNet to form the dual backbone of the proposed model, which enables it to fuse the feature information extracted from different backbone networks in the case of the availability of only one data mode. Generally speaking, using different backbone networks allows for the extraction of complementary feature information, and by fusing these features, more comprehensive target information can be captured, thus improving the detection accuracy of the model. Data from a single source are susceptible to noise, occlusion, etc.; the fusion of similar multimodal data can enhance the robustness of the model based on the redundant information among features extracted from different backbones. When the data extracted from one backbone are subject to disturbance, the data from the other backbone can still provide effective information support; in addition, by extracting features through the dual-backbone network, the model can learn more generalized feature representations, which can lead to better performance in different background environments. Specifically, as the first feature extraction backbone, we used GhostNet, which is constructed based on the Ghost module and consists of 6 layers of GhostConv and 4 layers of C3Ghost with SPPF modules. Then, we adopted Depth-Separable Convolution (DepthSepConv), which is composed of consecutive six layers of 3 × 3 DepthSepConv and three layers of 5 × 5 DepthSepConv.
In the neck segment, the original structure’s multiple upsampling steps often introduce adverse effects, such as noise amplification. Therefore, we refrained from using upsampling in the standard YOLOv8 model to merge small-scale high-level convolutional features with large-scale low-level ones. Instead, we optimized the neck part by leveraging sufficiently rich feature maps from dual-stem feature extraction for fusion. This approach reduces the model’s computational load, simplifies its complexity, and contributes to a lighter model. After feature extraction, the 5th, 7th and 10th layers of each trunk are extracted for their input into the neck section. In the neck part, the feature maps extracted from each layer are simply fused. It is worth mentioning that we not only streamlined the structure but also used C3Ghost, instead of the C2f module in the original YOLOv8 network, to further reduce the parameters and computation of the model.
To enhance robustness against inference, we employed the SAHI algorithm, which augments the feature information of small targets by leveraging data from image slices. This capability makes SAHI highly adept at detecting small targets, which often occupy a limited number of pixels in an image and lack the necessary detail for traditional detection methods to function effectively. SAHI effectively boosts the feature representation of small targets with slicing and weighted fusion techniques, leading to enhanced detection accuracy. The SAHI algorithm slices the waterfall map formed based on side-scan sonar data acquisition into multiple slices, on which it performs target detection independently. This parallel processing can significantly improve detection efficiency, especially when processing side-scan sonar images or in scenarios with high computational resource requirements. Moreover, by focusing on smaller slices, the SAHI algorithm can reduce the consumption of memory and computational resources. This is especially important for devices with limited computational resources or real-time detection systems.
3.2. SAHI
Distinct from the real-time segmentation methods referenced in [
34], target detection necessitates detailed attributes such as the size, shape, and spatial positioning of underwater targets for their precise location and identification. To ensure an underwater target is not divided between images, overlapping coverage between adjacent samples is crucial. For real-time detection, this entails intensive sampling along the track. Each sample undergoes processing with a d × d pixel sliding window, which creates sections of the same size as it moves horizontally along the track intersection. To achieve precise contours and positions through image segmentation or subsequent processing, adjacent blocks (such as P1 and P2 in
Figure 3) share a common coverage area. However, an excessively high overlap rate increases the slice count and prolongs inference time, while too low an overlap may result in incomplete target representation. Setting the overlap rate to 20%, meaning the shared area is 20% of the slice size, ensures that the target remains intact and fulfills real-time processing demands. Although using this method can improve accuracy in small-target detection, there is also a need to balance model accuracy and inference time to determine the scale of the sliding slices.
The specific principle is shown in
Figure 4, where the original image, I (blue box), is first cut into M × N slices (red box) with a certain overlap, denoted by P1, P2……PX; then, each slice is resized while maintaining the aspect ratio. Afterwards, the content in each slice is predicted. As the slice size becomes smaller, the model’s detection performance on larger targets decreases. Therefore, in order to detect the latter more accurately, NMS combines the prediction results of the slices and the FI results of the original image, bringing them back to the original size; in the NMS process, the frames with an IoU ratio higher than the pre-set matching threshold (Tm) are matched, and the probability of detection for each match below Td is removed.
3.3. PP-LCNet
PP-LCNet is a lightweight convolutional neural network model with the advantages of high efficiency, low latency, and low computational cost [
35]. It outperforms other lightweight models in multiple tasks and enhances detection accuracy while working with the fast speed of the MobileNet [
36,
37] network, which is more suitable for embedded devices and mobile application scenarios.
The YOLOv8 network architecture includes a Darknet-53 backbone module with a deeper structure, enhancing the model’s capacity to represent image features. However, this depth leads to increased computational complexity, resulting in longer durations for both model training and inference. The overall structure of PP-LCNet is shown in
Figure 5. Within the module, the stem component employs 3 × 3 standard convolution for feature extraction, primarily targeting the extraction of low-level features from the input image. The depth-separable convolution operations, comprising depth-wise (DW) and point-wise (PW) convolution operations, serve to decrease the number of model parameters and reduce network computation. PP-LCNet, which utilizes locally connected blocks to build an efficient deep neural network, employs DepthSepConv, proposed in MobileNetV1 [
14], as the basic module to reduce the computational complexity and improve the generalization ability of the network. This module lacks operations such as shortcuts, thereby eliminating the need for additional operations, such as concatenation or element-wise addition, which hinder the model’s inference speed without enhancing accuracy, particularly in smaller models. Furthermore, prior research has indicated that mixing convolutional kernel sizes within the same network layer slows down inference. Therefore, we utilized a uniform kernel size per layer, opting for a larger kernel that balances low latency with high accuracy. Notably, it was discovered that substituting the 3 × 3 kernel with a 5 × 5 kernel only in the network’s tail achieved nearly equivalent benefits as replacing kernels throughout the entire network, prompting this substitution only in the tail section. Moreover, the SE module was added to the module at the tail of the network; it dynamically adjusts the importance of different channels in the network by introducing an attention mechanism to increase the model’s attention to important features, thus enhancing the salient features, suppressing unimportant features, and improving the discriminative ability. In order to improve the inference speed of the network, the activation function in the convolution module adopts Hard-Swish, an approximation of the Swish function, while the Sigmoid function in the SE module is replaced with the Hard-Sigmoid function, which is less computationally intensive, in order to avoid a large number of exponential operations to improve the computational speed.
3.4. GhostNet
The GhostNet network [
38] employs a streamlined design centered around multiple Ghost bottlenecks. Each Ghost bottleneck is constructed by using the Ghost module, which hinges on dividing the convolution operation into two stages. The first stage involves performing a limited number of standard convolution operations. The second stage generates “ghost” feature maps by applying cheap linear convolution operations to the feature maps obtained from the first stage. The first part is a small number of ordinary convolution operations; the second part is a chunked linear convolution operation of the feature maps obtained in the first part, which generates “phantom” feature maps at a small cost. Compared with a normal convolutional neural network, the total number of parameters required and the computational complexity of the Ghost module are reduced, while the output feature maps are of the same size.
The core idea of the Ghost module is to utilize ordinary linear variations to obtain redundant feature maps as a way to improve the computational efficiency of the network.
Figure 6a shows the traditional convolutional structure, while the phantom convolution uses depth-wise convolution as a cheaper linear transformation, as shown in
Figure 6b, where ϕ denotes the linear transformation. This structure makes the current channel feature relevant only to itself, simulating redundant features on the one hand, and significantly reducing the number of parameters and computation on the other. In differs from regular convolution, which directly produces all feature maps, as GhostConv first executes a convolution operation that yields fewer feature maps. Subsequently, it applies a convolution transformation to these initial feature maps to produce both constant mapping and additional feature maps.
This method efficiently decreases both the computational load and the number of parameters, as illustrated by the following comparison with the standard approach.
Let us consider an input data tensor with dimensions C × H × W, which represent the input channels, height, and width of the feature map, respectively. Once a convolution operation is executed, the resulting data tensor features dimensions of N × H′ × W′, which represent the number of output channels, and the height and width of the produced feature map, respectively. Considering a typical convolutional kernel size of K and a linearly transformed convolutional kernel size of D, after S transformations, r
s in Equation (1) is the speedup ratio (here, the computational volume is used as an approximation instead of speedup), which represents the ratio of the number of original convolution operations to the number of computations of the Ghost module. r
c in Equation (2) is the compression ratio, representing the ratio of the number of parameters of the original convolution operation to the number of parameters of the Ghost module:
Above, N/S is the output channel taught in the first transformation, and S − 1 is included because constant mapping does not need to be computed but counts as part of the second transformation. Therefore, the Ghost module saves computation.
Ghost bottlenecks are bottleneck structures that incorporate Ghost modules as shown in
Figure 7. Each Ghost bottleneck comprises two stacked Ghost modules: the first one serves to expand the number of channels (known as the expansion layer), with the ratio between the output and input channel counts defined as the expansion ratio. The second Ghost module then decreases the channel count to align with the channels in the shortcut branch.
To reduce the width and height of the feature layer, we configured the Ghost bottlenecks with a stride of 2, indicating a step size of 2. In this scenario, additional convolutional layers are included within the bottlenecks. Furthermore, in the main part of the bottlenecks, both Ghost modules incorporate a depth-separable convolution operation with a 2 × 2 stride to achieve significant compression in both the width and height of the feature layer.
5. Conclusions
Aiming at addressing the problems of existing target detection models with many parameters, long computation time, and high computing requirements, we developed DBnet, and in order to make it lightweight, we selected a lightweight backbone network; at the same time, in order to solve the problem of accuracy loss caused by this choice, we designed a two-branch structure, that is, we employed a dual-backbone network for target feature extraction. This enables the model to mine more target feature information from the image and realize efficient feature extraction and fusion even if the input consists of single-mode data. In addition, on account of this two-branch structure, we only need to fuse the corresponding feature layers in the neck to allow the model to better characterize the target. While maintaining high detection accuracy, the number of parameters and computation amount of the model are greatly reduced, achieving a balance between detection speed and accuracy. Finally, for addressing the problem that original waterfall maps of side-scan sonar are large in size and prone to the loss of detail information after their input into the network, we adopted the slice-assisted hyper-inference (SAHI) technique, which splits large-size images into multiple small-size images for inference, improving target detection accuracy by fusing the detection results of each slice. Compared with the baseline model, DBnet presents 33% fewer parameters and 31% less computation (GFLOPs) while maintaining accuracy, which is especially important in resource-limited environments. The effectiveness of DBnet is further confirmed by test results on the SSUTD and SCTD, with the mAP values improving by 2.3% and 6.6%, respectively. In addition, the lightweight design of DBnet makes it easier to deploy in engineering applications, especially in mission scenarios such as AUV underwater target detection, that require real-time detection capabilities.