1. Introduction
In recent years, with the increasingly large scale of development and utilization of marine resources, underwater optical visual target detection with extensive application scenarios has played an increasingly important role in the field of underwater security [
1], marine exploration [
2], fish farming [
3], and marine ecology [
4]. Therefore, achieving underwater autonomous operation through visual target detection by using underwater optical images has become a research hotspot in the field of computer vision.
With the rapid development of deep learning framework, the visual target detection algorithm has shifted from a traditional algorithm based on manual feature extraction to detection technology based on a deep neural network. From the RCNN detection algorithm proposed in 2013, Fast RCNN [
5], and Faster RCNN [
6], two-stage target detection algorithms obtained by improving detection precision to one-stage target algorithms, including SSD [
7] and YOLO [
8,
9,
10,
11,
12] have considered both detection precision and detection speed. At present, many scholars have applied target detection methods based on deep neural networks to underwater target detection.
Water has absorption and scattering effects on light. Compared with ordinary images, underwater images have low color contrast, blue–green tone, fog, and other phenomena [
13,
14]. Thus, the imaging quality is poor. In addition, different from most target detection tasks on land, which can be implemented on general computer platforms, underwater target detection tasks are generally conducted on ROV, AUV [
15], and other platforms with a limited volume or power supply [
16]. Thus, they can only run in embedded systems. Moreover, underwater target detection usually faces two major difficulties: serious image degradation and limited computing resources.
When the visual target detection method based on a deep learning framework is applied to marine organism detection, they will overlap and block each other because of the clustering effect. In addition, the underwater movement of the carrier carrying the camera is usually limited to a certain extent. Consequently, the marine organisms in underwater images are usually not uniform in scale and size. Some marine organisms are small in size and distributed unevenly, which aggravates the difficulty of organism detection. In addressing the problem on small target detection, Kisantal et al. [
17] proposed replication and enhancement for small targets, which are used to increase the number of training samples of small targets by copying and pasting small targets in images many times to improve the detection properties of small targets. Moreover, Chen et al. [
18] balanced the quantity and quality of medium and small targets by scaling and splicing [
19]. The existing coping strategies aim to increase the weight of small target samples in the loss function through increasing the number of small targets. However, it cannot fundamentally solve the problem that small targets are difficult to detect because of the small proportion of pixels in the image and the small amount of information.
In addition, visual target detection based on the deep learning framework is a typical data-driven method, which requires a large number of high-quality image data to train the network. However, obtaining underwater images is difficult. Furthermore, images containing a specific underwater organism are limited. Therefore, the sample imbalance is significant. In the field of land visual target detection, the proportion of different quantities of targets in the dataset are usually balanced by under sampling/oversampling the samples or in accordance with the loss function of the samples. For example, online hard example mining (OHEM) algorithm proposed by Abhinav Shrivastava et al. can screen out a hard example based on the loss of input samples for specific training [
20]. Tsung-Yi Lin et al. proposed the loss function Focal Loss [
21] to allow the model to focus on difficult and error-detected samples. The existing sample balance methods have alleviated the negative impact of unbalanced sample number to a certain extent. However, when these methods are applied to underwater target recognition, the effects are generally poor because the sample imbalance of underwater images is more serious than that on land.
Aimed at the problems on existing methods in underwater organism detection, this study proposes a one-stage marine organism detection framework based on dataset augmentation and the fusion of Convolutional Neural Networks (CNN) and Vision Transformer (ViT). By introducing novel data augmentation methods, this framework can effectively cope with the problems such as the small size of underwater organisms and the scarcity of some species of organism samples. In addition, the one-stage improved YOLO-v5 model based on CNN-ViT fusion can complete real-time and high-precision underwater organism detection under limited computing resources. The main contributions of this paper are as follows:
(1) For the problems such as low recall and precision of target detection caused by different scales and sizes of marine organisms in underwater images and uneven location distribution, this study proposes a random expansion method for small targets used in the training dataset. Considering that no additional dataset is introduced and the sample image is not increased, the sizes of the targets in the sample image are randomly expanded, and the quantity of equal-sized images in the training set is greatly expanded, thereby decreasing the difficulty of model detection of small targets and effectively improving the recall and precision of model detection of small targets.
(2) Aimed at the problem that the specific target samples are limited because of the difficulty in obtaining underwater images, this study proposes a non-overlapping filling method for scarce samples combined with the random expansion of small targets. An efficient one-dimensional search method for the filling space is adopted to directly fill the scarce targets into the existing sample images, while ensuring that they do not overlap with the existing targets. Without the help of additional datasets, the scarce targets are effectively supplemented in the training stage, thereby alleviating the imbalance of the number of samples in the dataset.
(3) In view of the limited energy of autonomous underwater vehicles and the serious limitation of computing resources, this study proposes a light detection model taking YOLOv5 as the baseline. It introduces the ViT module, deformable convolution and trident block, and utilizes Focal Loss as the objective function of classification loss and target confidence loss to achieve high-precision and real-time detection of marine organisms on the resource-constrained underwater mobile platform.
2. Related Works
2.1. Target Detection Algorithm
Since Krizhevsky Alex et al. proposed AlexNet based on convolution neural network architecture and won the ILSVRC 2012 challenge championship, CNN has become the mainstream technology in computer image processing.
In 2014, R. Girshick proposed the use of features extracted by Region-Convolutional Neural Networks (RCNN) in target detection. Then, target detection began to develop at an unprecedented speed. The RCNN series algorithm, which extracts the target recommendation box obtained by selective search, is a classic two-stage deep learning target detection algorithm.
In 2017, Ashish Vaswani innovatively proposed the Transformer architecture based on encoder–decoder and abandoned RNN or CNN architecture. Transformer can effectively solve the problems that RNN cannot be processed in parallel and CNN cannot efficiently obtain long-distance dependence [
22].
In utilizing the inherent information in the feature to make the interaction of attention and reduce the dependence on external information, the self-attention mechanism can be introduced. In screening out a small amount of important information from a large amount of information and focusing on this important information, ignoring most unimportant information, an attention mechanism combined with a spatial attention mechanism and channel attention mechanism is adopted.
Transformer has a significant impact in the field of Natural Language Processing (NLP). Its applications, such as Bidirectional Encoder Representation from Transformers [
23], Generative Pre-trained Transformer [
24], and Text-To-Text Transfer Transformer [
25], promote this field to obtain the best performance. Transformer’s success in NLP has aroused its interest in the field of computer vision because it can integrate global semantic information.
Transformer architecture has been directly applied to non-overlapping medium image blocks for image classification by ViT’s groundbreaking work [
26]. Compared with the work based on CNN, the impressive accuracy in image classification is achieved. CNN can capture local-range information, whereas ViT can capture long-range information. In extracting image features, ViT is a useful supplement to CNN. Their combination can effectively extract the features of different targets with evident scale differences, such as Deformable-Detr [
27].
Although the highest detection accuracy of target detection algorithms is constantly refreshing, these detection algorithms perform poorly in underwater AUV or ROV. These high-precision target detection algorithms are restricted by limited computational power and limited power supply; therefore, a higher requirement for underwater target detection algorithms is proposed in the actual application.
2.2. Data Augmentation
The data augmentation strategy can generate more additional data based on existing data to improve the performance of the machine learning system, deep learning classification model, target detection model, and so on. The correct use of a data augmentation strategy can effectively improve the accuracy and robustness of the machine learning model. Different from the fine-tuning of the neural network structure, the data augmentation strategy only increases the training cost and time, but does not increase the inferencing cost.
The imbalance of the sample number is a major difficulty in the target detection task, which can be simply described as follows: Some target classes to be detected are less, whereas some target classes to be detected are more. We can increase the number of samples with a small number of target classes, i.e., oversampling, and decrease the number of samples with a large number of target classes, i.e., undersampling, to solve the problem by balancing the proportion of different numbers of targets in the dataset. OHEM algorithm screens out hard examples based on the loss of input samples, which are samples that have a great impact on classification or detection [
20]. Then, these samples obtained from screening are applied to training in random gradient descent. Tsung-Yi Lin et al. [
21] proposed a new loss function Focal Loss [
21] to increase the weight of difficult samples and allow the model to focus on difficult and error-detected samples. This loss function alleviates the negative impact of the imbalance of the sample number to a certain extent, but it does not solve the difficulties in essence. Chia-Hung Yeh et al. [
28] skillfully solved the imbalance of a small sample number by pasting the target in the dataset into the underwater background image, which improved the performance of the model. However, this strategy requires enough images with underwater scenes, but obtaining the suitable and available images with underwater scenes similar to those in the dataset is difficult.
An excessively small target size is another difficulty in the target detection task. A small target has a low proportion of pixels, which contains less information, causing great difficulty in positioning. In recent years, solutions to small target detection have emerged. In view of data augmentation, Yu et al. [
29] proposed a scale matching strategy. Based on the differences of targets in size, the size of the whole image was adjusted. Considering limited GPU resources, the image was cut again to narrow the gap between targets with different sizes and to prevent the information of small and medium targets from becoming lost in the conventional zoom operation. This strategy is effective and feasible, but it is very cumbersome. Aimed at the small area covered by small targets, the lack of diversity of occurrence positions, and the intersection ratio between the detection frame and the truth value box is far less than the expected threshold, Kisantal et al. [
17] proposed a replication and enhancement method, which increases the number of training samples of small targets by copying and pasting small targets in the image multiple times, thus improving the detection performance of small targets. This strategy is very ingenious, without losing additional GPU resources, but it will cause an imbalance of targets in number. Based on the small proportion of the quantity of small targets and less information contained in the image, Chen et al. [
19] zoomed and spliced images during training and transferred large targets in the dataset into medium targets and medium targets into small targets. In addition, they improved the quantity and quality of medium/small targets considering the consumption from GPU computation. In particular, Kisantal et al.’s strategy that copies small targets and Chen et al.’s strategy that zooms and splices small targets only increase the weight of small target samples in loss function by increasing the quantity of small targets, which does not completely solve the problem of small target detection, but makes small targets smaller and more difficult to detect.
To sum up, underwater target detection is usually hindered by small size, uneven distribution, serious imbalance of samples, and other situations. Most of the existing methods directly follow the sample augmentation and equalization methods on land, without considering the current situation that underwater images are difficult to obtain and sample sets are extremely scarce. Aimed at the problems appearing in the application of existing methods to underwater target detection, this study introduces novel data augmentation methods in the model training to effectively address the problems such as the small size of underwater organisms and the scarcity of some species of organism samples. Moreover, a lightweight detection model taking YOLOv5 as the baseline is adopted to achieve real-time and high-precision underwater organism target detection under limited computing resources.
3. Method
This study proposes a one-stage marine organism detection framework based on dataset augmentation and CNN-ViT fusion. The overall structure of the framework is shown in
Figure 1, and its neural network structure is shown in
Figure 2. This framework primarily consists of three parts; namely, data augmentation strategy, backbone feature extraction module, and detection head module.
The data augmentation strategy is applied in the training stage of the framework. At this stage, three methods; namely, image enhancement, random expansion of small objects, and non-overlapping filling of scarce samples; are adopted to address the deterioration of underwater imaging, the underscaling of marine organism targets, and the extreme lack of some organism samples. In the feature extraction module, YOLOv5 is taken as a baseline, and ViT is introduced in backbone to achieve the fusion of CNN and ViT, which has both the local-range information capture ability of CNN and the long-range information capture ability of ViT. In addition, deformable convolution [
30] and trident block [
31] are introduced to extract more abundant features through a multi-scale receptive field to deal with the feature extraction of marine organism targets with different shapes and scales. In the detection head module, based on the multi-scale features output by the path aggregation network (PANet) [
32], the category of targets in the detection box is output by classification branch, and the target in the detection box output by regression branch is mapped back to the position in the original image.
3.1. Data Augmentation
3.1.1. Random Expansion of Small Objects
In improving the precision of small target detection, the pixel size of small targets must be increased as much as possible. The common practice is to directly expand the size of the input image. Although this practice can improve the precision of small target detection, it will bring a higher computational load during training. In order to reduce the amount of computation during training, we can use the label of the original dataset to randomly expand the size of small targets in the image, so as to obtain a new image. Consequently, by expanding the size of the small target, the number of images in the training set can be expanded without introducing additional datasets, thereby completing data augmentation. It is helpful to train the detector by reducing the recognition difficulty of hard cases, which is inspired by Dynamic-RCNN [
33].
As shown in
Figure 3, when the target to be detected is small in size, the model can hardly find the target, and it cannot generate enough and effective prediction boxes (
Figure 3a). After the size of the small target is slightly enlarged, the target may be detected in the blue prediction box. However, the target is incorrectly recognized as a starfish, resulting in a classification error (
Figure 3b). When the target is expanded to be large enough, a white prediction box is observed in which the target is detected and correctly determined as a scallop (
Figure 3c).
Based on the abovementioned ideas, this study proposes a data augmentation method based on random expansion of small targets. The specific practices of this method are as follows: In the training stage, the positions of the center points of the rectangular boxes where all the targets in the image are located as
and the widths and heights of the rectangular boxes
are also obtained. The rectangular boxes where small targets are located are selected. The positions
of the chosen rectangular boxes in the image remain unchanged, and their widths and heights are changed to
. Bilinear interpolation is adopted to expand the image in the rectangular box and cover the small target at the original position, thereby acquiring a new image. To limit the expansion range of the label image, the super parameter
is set, representing the maximum number of randomly expanded pixels, that is,
. In order to avoid the overlapping of the expanded target label images as much as possible, which will affect the effectiveness of the target, the expansion selected in this study is 5. A randomly expanded image example of small targets is shown in
Figure 4. The two echini on the rightmost part of
Figure 4 overlap after size expansion. Therefore, the upper limit expansion size
should be carefully set.
Notably, random expansion of small objects can achieve the expansion of the training set without introducing additional datasets. In addition, it only changes the training strategy rather than modifying the structure of the target detection model, manually setting the shape and size of the anchor frame and increasing the weight for difficult samples. Therefore, it only increases the time cost of the training model, but it does not affect the speed of model inferencing and will not increase the complexity of the algorithm.
3.1.2. Non-Overlapping Filling of Scarce Samples
Compared with land images, underwater images have limited acquisition channels. Therefore, there are scarce underwater images of specific targets. Furthermore, images containing certain underwater organisms are scarcer. Sample imbalance is very significant.
This study proposes a sample number balancing method that directly fills the scarce targets into the existing sample images. In avoiding the overlap between filling the target and existing target during filling, we need a strategy to search a region to fill the target rapidly. In addition, before the training stage, we need to create a scarce target set. The filling process is shown in
Figure 5: firstly, as shown in the red box, we should select an image from the scarce target set. Secondly, the image will go through the label augmentation pipeline, the selector will randomly choose a scarce target and the target will flip and resize. Thirdly, as shown in the blue box, the target will be pasted in the input image.
Considering that the image to be filled also contains other targets, the existing targets in the image cannot be covered during target filling. The label image can be easily filled in the input image that contains only one target. However, in filling the label image without overlapping the original target present in the input image that contains more than one target, there is a regional search problem in the two-dimensional space. When the number of images to be filled is large, the time consumption of the two-dimensional space search algorithm becomes unacceptable. In lowering the complexity of target filling space search, the idea of dimensionality reduction is adopted to transform the problem of finding the appropriate two-dimensional area into the problem of finding the appropriate one-dimensional area.
The specific methId is shIwn in
Figure 6: when we are looking for areas to horizontally fill targets, as shown in
Figure 6a, the rectangular area demarcated along the horizontal direction of the ordinate interval (the interval marked by the red arrow) is the unfillable area, and the remaining green rectangular areas are fillable. Filling the targets in these fillable areas must not overlap with the original targets; similarly, when we are looking for areas to vertically fill targets, as shown in
Figure 6b, the rectangular area demarcated along the vertical direction of the abscissa area (the area marked by the red arrow) is the unfillable area, and the remaining green rectangular areas are fillable. Consequently, the region that can fill scarce sample targets in the image with uncertain target distribution can be obtained at a very rapid speed, thereby greatly reducing the complexity of filling area search. Its time complexity is only
, and the spatial complexity is
.
indicates the number of existed targets in the image.
If a starfish is taken as a scarce sample, the effect of filling in the vertical direction (
y direction) in the image with multiple echini is shown in
Figure 7. As shown in the left sub-figure of
Figure 7, black boxes identify the targets. The red box is the minimum closure of all black boxes, and it is easy to fill targets out of the red box. Thus, we mainly discuss the area in the red box. With the help of the label augmentation pipeline and the one-dimensional non-overlapping filling area search method described in this section, we filled several starfish of different sizes along the vertical direction into the left subfigure of
Figure 7, and the results are shown in the right subfigure of
Figure 7.
3.2. Target Detection
The marine organisms to be detected in the proposed framework are mainly echini, holothurians, starfish, and scallops, which are all included in the images of URPC2019 dataset [
34]. These marine organisms have different colors, shapes, and textures. The underwater environment has serious color deviation and dark light. The distances and angles of taking pictures are different. In addition, a certain number of starfish and scallops are covered by the deposited dust. All the factors cause difficulties in target detection. In addition, considering that the detection framework proposed in this paper will run on the embedded platforms with limited computing resources such as AUV [
15], strict requirements are provided for the size and complexity of the framework. In view of the above difficulties and requirements, we improve YOLOv5 by introducing a ViT module, deformable convolution, and trident block, as shown in
Figure 8, and adopting Focal Loss [
21] as the loss function of classification loss and object confidence loss.
3.2.1. Backbone of Feature Extraction Network
YOLOv5 successfully absorbed the idea of MobileNet [
35] to reduce the amount of computation by separating convolution, and adopted a large number of combinations of conv and C3 modules, as shown in
Figure 9. The conv module undertakes the task of deep-level convolution, which is composed of multiple single-layer convolutions; the C3 module undertakes the task of pixel-level convolution, which primarily contains multiple convolutions with a core size of 1, and it is composed of a residual structure.
Given the successful application of Transformer in the NLP field, ViT attempts to apply the standard Transformer structure directly to the image and make the least modifications to the entire image classification process. In particular, in the ViT algorithm, the whole image will be divided into small image blocks, and then these small image blocks will be sent to the network as the input of Transformer in the form of sequence. Afterward, image classification training is performed through supervised learning.
In ViT, the dependence of the window-based self-attention mechanism on external information is reduced, and the inherent information in the local feature is used as much as possible to make interactions of attention and obtain the correlation in the spatial dimension. However, the attention information stored between channels is also noteworthy. Channel attention aims to show the correlation among different channels modeled and automatically acquire the importance of each feature channel through network learning. Finally, different weight coefficients are assigned to each channel to strengthen important features and inhibit non-important features.
Drawing lessons from the advantages in the accuracy of ViT in solving computer vision tasks, we combine the architecture of ViT with YOLOv5 (
Figure 10). The ViT module aims to extract self-attention information of each layer in the feature map; the role of the convolutional module is to allow downsampling of the input feature map, which narrows the size of the feature map and improves the abstraction of the feature.
3.2.2. Deformable Convolution
In order to make the features output by the feature extraction network better adapt to underwater targets with different scales and shapes, adaptive offset deformable convolution is adopted in this study, as shown in
Figure 11. The geometric structure of the convolution kernel in the general convolution neural network is fixed; thus, its ability in geometric transformation modeling is limited. Without additional supervision, deformable convolution can add a spatial sampling position through additional offset and learn the offset from the target task. Therefore, compared with the fixed-sized convolution kernel, deformable convolution can improve the detection performance by slightly increasing the computational load.
3.2.3. Trident Block
According to the previous discussion on the underwater dataset to be used in this study, the underwater organism targets in the dataset have large scale differences, which may be the main reason for the low recall rate of the detection model. To improve the recall rate of the detection model, the number and diversity of recommendation boxes in this study must be increased. Thus, different dilation rates are adopted in the C3 module to ensure multi-scale target detection with a multi-scale receptive field. In addition, inspired by Trident Net, parameter sharing is adopted for multiple bifurcations of convolution to obtain the C3 trident block, as shown in
Figure 12. It has various dilation rates, can extract rich features, and reduce the number of parameters of the model. Although the parameters of multiple bifurcated convolution kernels are the same, different dilation rates can ensure the diversity of extracted features. Therefore, they can detect the targets in different receptive fields.
3.2.4. Loss Function
As previously mentioned, in order to achieve the balance of the number of samples, Focal Loss [
21] is used as the classification loss and object confidence loss to weigh the loss of samples that are difficult to classify and the loss of a small number of positive samples. In addition, CIoU Loss is used as the regression loss to produce an accurate detection effect during dense target detection.
The objective function of Focal Loss is shown in Equation (1).
where
p refers to the predicted object score,
; and
specifies the ground-truth class.
For each foreground pixel, q of the real class is set as the IoU value of the generated predicted boundary box and real boundary box, whereas q of the non-real class is set as 0. For each background pixel, its target score q is set to 0. In this study, are hyperparameters, which are valued as 0.25 and 1.5, respectively.
Focal Loss retains the penalty of the model for negative sample error judgment. Considering that the positive samples are scarcer than the negative samples, it is unbalanced in the number of positive and negative samples. The model is under the supervision of Focal Loss, which allows the model not only to correctly determine positive samples, but also to punish the wrong judgment of positive samples. Because high-quality positive samples are more conducive to achieving high average precision value (AP) than low-quality positive samples, the Focal Loss function focuses more on high-quality positive samples during model training. Hyperparameter is used to balance the loss of positive and negative samples; and is used to lower the influence of a large number of negative samples on the loss function.
In this study, CIoU Loss is used as the boundary box regression loss. IoU represents the area ratio of intersection and union of rectangles
A and
B, which is used to measure the overlapping degree of two intersecting rectangles, as shown in Equation (3).
The disadvantage of directly using
IoU as the loss function is that
IoU = 0 when the prediction box does not overlap with the real box, particularly when both of them are far away. Adjusting the position of the prediction box still cannot change
IoU; thus, the gradient is always 0. Furthermore, finding the correct regression direction through gradient descent is difficult. In order to compensate for
IoU’s inability to measure the distance between two disjoint rectangles, Zhaohui Zheng et al. proposed CIoU [
36]. The main improvement of CIoU on
IoU is to consider the similarity between the prediction box (green) and the real box (black), as shown in
Figure 13.
The expression of CIoU is as follows:
where
refer to the central points of the prediction box and real box, respectively;
refers to the Euclidean distance between two central points.
refers to the diagonal distance in the minimum closure area that contains the prediction box and real box.
is used to measure the similarity of the length–width ratio of the prediction box and real box;
refers to a weight function.
The complete loss function is calculated as follows:
where
refer to the predicted and targeted confidence scores of category
c at position
I in each level of the feature map output by the network;
and
, respectively, represent the prediction box and corresponding real box at position
i on the feature map of each level output by the PANet; hyperparameters
,
, and
refer to the weights of regression loss, classification loss, and IoU loss of the boundary box, which are 0.05, 0.5, and 1.0, respectively, in this study.