1. Introduction
In autonomous underwater exploration tasks, target detection technology based on side-scan sonar (SSS) imagery plays a crucial role [
1,
2,
3]. In traditional methods, target detection in sonar images relies on pixel features [
4], greyscale thresholds [
5], a priori target information [
6], and artificial filters [
7]. However, due to the quality of marine sonar images, traditional methods struggle to identify good pixel features and grey scale thresholds. They also require a significant amount of manpower for discrimination, making them inefficient. In recent years, deep learning-based approaches have shown significant advantages and have seen extensive applications in the field of underwater target detection [
8,
9]. Nevertheless, the precision of deep learning detection algorithms has consistently failed to meet practical engineering standards. The multitude of spatial features and pronounced shadow effects present in target features within side-scan sonar images pose significant challenges to achieving high accuracy in target detection. Therefore, investigating methods to integrate the unique characteristics of side-scan sonar images into target detection algorithms holds paramount importance for enhancing the overall detection performance.
In the domain of target detection, such as You Only Look Once (YOLO) [
10], Single-Shot MultiBox Detector (SSD) [
11], and Retina NET [
12] have showcased substantial potential for accuracy. However, due to the unique attributes of sonar data, there is a limited focus on developing detection algorithms specifically tailored for sonar images. Underwater targets in side-scan sonar images can only be shown as grey-scale bright spots of different geometric sizes due to the mapping of acoustic signal intensity, and the image geometric features are not obvious, which is significantly different from the imaging quality of traditional optical images. Even for the same underwater target, the image features displayed in the side-scan sonar image can vary due to the differences in draught depths, positions, and sonar angles. Current research often involves integrating sonar feature algorithms into existing detection models to enhance their resilience in effectively processing sonar data.
In the domain of sonar image detection, the earliest strategy employed transfer learning, adapting the weight model from optical images to enhance the performance of sonar images [
13,
14,
15]. Zacchini et al. [
16] used mask-region convolutional neural networks (Mask-RCNN) to automatically identify and localize targets in forward-looking sonar images, but the neural network structure only performs target detection, not target recognition. It proves the feasibility of the deep learning method in forward-looking sonar. Yulin et al. [
17] proposed a transfer learning-based convolutional neural network method for side-scan sonar wreck image recognition, which uses the characteristics of the side-scan sonar wreck dataset and improves the network with reference to the Visual Geometry Group (VGG) model, which greatly improves the training efficiency as well as the accuracy of recognition, and results in the faster convergence of the model. This paper proves the feasibility of transfer learning in image target recognition in side-scan sonar. To more fully prove the feasibility of transfer learning, Du et al. [
18] adopted the self-made submarine dataset and four traditional convolutional neural network (CNN) models for training and prediction compared the prediction accuracy and calculation performance of the four CNN models, and certified the efficiency and accuracy of transfer learning in side-scan sonar image recognition. On this basis, to improve the accuracy of transfer learning. Zhang et al. [
19] improved the traditional k-means clustering by transfer learning. The intersection over union (IoU) value is used as the distance function to cluster the labeling information of the training set of the forward-looking sonar image. An improved feature extraction method of CoordConv was proposed to give corresponding coordinate information to high-level features which improves the accuracy of network detection regression.
To address the problem that the target features of underwater sonar images are not easily extracted, researchers have improved the network based on its good performance in optical data, and many studies have been conducted on innovation and application to underwater environments [
20,
21]. Zhu et al. [
22] introduced an innovative underwater unmanned vehicles (UUV) sonar automatic target recognition method. This approach involves utilizing the CNN to extract pertinent target features from sonar images. The subsequent classification is achieved through a support vector machine (SVM) trained with manually annotated data, thereby establishing its superior proficiency in sonar image feature extraction. In order to improve the efficiency of sonar image feature extraction, Kong et al. [
23] proposed an improved You Only Look Once v3 (YOLOv3) real-time detection algorithm, YOLOv3-DPFIN, to achieve the accurate detection of noise-intensive multi-class sonar targets with minimal time consumption. Dual-path network (DPN) module and fusion transition module are used for efficient feature extraction, and dense connection method is used to improve multi-scale prediction. Li et al. [
24] used the YOLO model and introduced an effective feature encoder to enhance the performance of the feature map. In the model training stage, the channel-level sparsity regularization was carried out to improve the inference speed. He et al. [
25] proposed a sonar image detection method based on low-rank sparse matrix decomposition. The feature extraction and noise problems are described as matrix decomposition, the improved robust principal component analysis algorithm is used to extract the target, and the fast near-end gradient method is used to optimize the target. Based on the above, to improve the robustness of the detection model, Wang et al. [
26] proposed a new convolutional neural network model, which uses the depth separation residual model to extract the sonar target region at multiple scales, uses the multi-channel fusion method to enhance the feature information, uses the adaptive supervision function to classify the target, and improves the generalization ability and robustness of the side-scan target recognition. Song et al. [
27] proposed a sonar target detection method based on the Markov model and neural network model. The neural network model is used to fully extract sonar image information and the Fully Convolutional Networks (FCN)-optimized by Markov model is further used for sonar target segmentation. In the direction of network structure optimization, Fan et al. [
28] built a 32-layer residual network to replace ResNet50/101 in MASK-RCNN, which reduced the number of parameters in the network, while ensuring the accuracy of target detection, and replaced the Stochastic Gradient Descent (SGD) to improve with Adagrad (adaptive adjustment of learning rate) optimizer, and finally used 2500 sonar images for cross-training to validate the accuracy of the network model.
Existing methods often overlook the strong correlation between target shadows and target features. In this paper, we introduce the attention mechanism to target detection in side-scan sonar images, aiming to utilize the correlation between target shadows and target features fully. Additionally, there is a lack of research on the multi-spatial characteristics of target features and a deficiency of real side-scan sonar datasets to support such studies. Therefore, optimizing existing algorithms to address these key issues in side-scan sonar target detection is highly worthwhile. This paper utilizes a small sample dataset collected from sea experiments for practical training. Attention mechanisms and size scaling factors are introduced into the You Only Look Once v7 (YOLOv7) [
29] model. Testing is conducted on both mainstream networks and the improved YOLOv7 network using public datasets and the collected dataset, further validating the crucial role of attention mechanisms and size scaling factors in side-scan sonar image target detection.
In summary, this paper makes the following key contributions:
The experimental dataset is used to address the challenge of limited generalizability in model training outcomes.
To enhance feature diversity and improve classification, a Swin-Transformer was integrated into the YOLOv7 backbone. This leverages the transformer’s ability to capture long-range dependencies and hierarchical features, boosting the model’s performance in detection tasks.
The existing Convolutional Block Attention Module (CBAM) structure is integrated into the prediction head of the YOLOv7 model, enhancing detection speed and accuracy for small targets in complex backgrounds.
An innovative scaling factor is introduced into the prediction head to address the problem of variable spatial feature geometry in side-scan sonar images.
The remainder of this paper is structured as follows.
Section 1 provides a review of side-scan sonar target detection algorithms, while
Section 2 introduces the framework of the side-scan sonar target detection model proposed in this paper.
Section 3 details the methodology for obtaining the dataset used in the experiments, and
Section 4 presents the experimental results. Finally,
Section 5 concludes the paper and outlines future work.
5. Conclusions
This paper aims to address the challenges of low precision and limited generalization capacity in side-scan sonar target recognition. Consequently, we integrated the SWIN-Transformer’s attention mechanisms, the CBAM structure and the Scale Factor into the YOLOv7 detection model to validate the efficacy of these attention mechanisms on established open datasets. Subsequently, the improved model will then be further validated using real-world ocean side-scan sonar image datasets. Our enhanced model demonstrates improvements in and by 9.28% and 8.14%, respectively, over the original model, effectively addressing real-world ocean engineering challenges. The comprehensive experimental validation effectively establishes the proposed model’s accuracy and generalization capabilities. This approach holds promise for applications in UUVs. By enabling automatic detection and identification capabilities, this model can significantly enhance the efficiency of exploring uncharted targets on the seafloor, thereby boosting the intelligence and operational effectiveness of UUVs.
Despite the enhancements, the neural network model still struggles with complex architectures and relatively slow detection speeds. In our future endeavors, we aim to develop a neural network model that balances high precision with a lightweight design for side-scan sonar image target detection. This forthcoming work seeks to create a model that not only meets stringent accuracy requirements but also ensures efficient deployment due to its lightweight nature.