1. Introduction
In recent years, the development and utilization of the ocean have gradually become an important development direction. Since underwater vision research is the basis of marine-related disciplines, the rapid development of underwater image processing technology is inevitable [
1,
2]. Image segmentation is a basic method of target extraction, which aims to partition an image into several constituent regions. Each region has coherent intensities, colors, and textures [
3]. Furthermore, image segmentation can provide technical support for target tracking, image restoration [
4,
5,
6], and other tasks.
Now, some results have been achieved in underwater image segmentation. Liu et al. [
7] proposed an improved level set algorithm based on the gradient descent method and applied it to segment underwater biological images. Wei et al. [
8] improved the K-means algorithm to segment underwater image backgrounds and addressed the issue of improper K value determination. Moreover, this algorithm can minimize the impact of the initial centroid position of a grayscale image. SM et al. [
9] used a canny edge detection algorithm to segment underwater images, whereas background noise greatly affected the canny edge detection algorithm. Sun et al. [
10] and Li et al. [
11] used fuzzy C-means to segment underwater images. Rajeev et al. [
12] used the K-means algorithm to segment underwater images. However, the clustering algorithms mentioned above are greatly affected by the local gray unevenness of underwater images. Moreover, clustering algorithms contain local convergence errors and are only suitable for underwater images with a single background gray level.
Some investigators have segmented underwater images based on their optical properties and achieved good results. For example, Chen et al. [
13] proposed an optical feature extraction, calculation, and decision method to identify the collimated region of artificial light and employed a level set method to segment the objects within the collimated region. This method could better identify the target region, but the level set method could not filter out background noise when the target region contained background information. Xuan [
14] et al. proposed an RGB(red, green, blue) color channel fusion segmentation method for underwater images. The proposed method obtained the grayscale image with high foreground-background contrast and employed the thresholding segmentation method to conduct fast segmentation. However, the disadvantage of this method is that when the color of the background region is similar to the foreground region, the target cannot be segmented.
Active contour models have also been used for underwater image segmentation. Zhu et al. [
15] used the cluster-based algorithm for co-saliency detection and highlighted salient regions in the underwater images. Then, the local statistical active contour model was used to extract the target contours of underwater images. Qiao et al. [
16] proposed an improved method based on the active contour model. The method used the RGB color space and contrast limited adaptive histogram equalization (CLAHE) to increase the contrast of the sea cucumber thorns and body. Then, the method extracted the edge of the sea cucumber thorns using the active contour model. Li et al. [
17] improved the traditional level set methods by avoiding the calculation of the signed distance function (SDF) to segment underwater images. The improved method could speed up the computational complexity without re-initialization. Bai et al. [
18] proposed a method based on morphological component analysis (MCA) and adaptive level set evolution to segment underwater images. The MCA decomposes the image into texture and cartoon parts sparsely. The new adaptive level set evolution method combines the threshold piecewise function with the evariable right coefficient and halting speed function and was used to obtain the edges of the cartoon part. Shelei et al. [
19] segmented underwater grayscale images by fusing the geodesic active contour model (GAC) and the Chan–Vese (CV) model. However, this method required the target region of the underwater image to have a uniform grayscale. Chen et al. [
20] integrated the transmission map and the saliency map into a unified level set formulation to extract the salient target contours of the underwater images. Antonelli et al. [
21] proposed some spatially varying regularization methods by using local image features (such as textures, edge, noise, etc.) to prevent excessive regularization in smooth regions and preserve spatial features in nonsmooth regions. However, it is difficult for this method to obtain accurate regularization parameters for underwater images with low contrast and blurred texture information due to blurred local features, such as textures and edges. Nawal et al. [
22] presented an efficient approach for unsupervised segmentation based on image feature extraction to segment natural and textural images. Unfortunately, this method is unsuitable for natural underwater images with weak texture information.
As a new technology of image processing, the neural network has also been used for underwater image segmentation. O’Byrne et al. [
23] proposed the use of photorealistic synthetic imagery to train deep encoder-decoder networks. This method synthesized virtual underwater images, and each rendered image had a corresponding ground truth per-pixel label map. The mapping relationship between the underwater images and the segmented images was established by training the encoder-decoder network. Zhou et al. [
24] proposed a deep neural network architecture for underwater scene segmentation. The architecture extracted features by pre-training VGG-16 and learned to expand the lower resolution feature maps using the decoder. The neural network achieved certain results in underwater image segmentation, but the lack of underwater data sets with corresponding functions is still a problem.
Most of the existing underwater image segmentation methods are used to segment images with high foreground-background contrast and single background grayscale. When the underwater images with varying background grayscale and the targets have complex textures, the segmentation results of the above methods are not satisfactory. To address the above problem, we propose a novel dual-fusion model with semantic information for salient object segmentation of underwater images with complex backgrounds. In summary, the contributions of our model are as follows:
We introduce saliency maps as semantic information to segment foreground information and background information;
The dual-fusion energy equation is proposed to extract the contours of saliency targets by integrating the local and global intensity fitting term;
For the missing saliency target information, we propose the correction module to correct the saliency target contour error by introducing the original image contour information.
This paper is organized as follows:
Section 1 reviews related works. In
Section 2, we introduce in detail the derivation process of the dual-fusion model.
Section 3 shows the experimental process, and we compare the proposed method with state-of-the-art segmentation methods, and the results demonstrate the superiority of the proposed methods. The conclusion of this paper is shown in
Section 4.
3. Dual-Fusion Active Contour Model
In this section, we propose a dual-fusion active contour model with semantic information to extract the target contours of underwater images. The existing methods cannot extract the target contour from the background without semantic information. So, it is necessary to introduce semantic information and roughly extract the saliency target contour from the complex background. To avoid the extraction error of the saliency target, we introduce the original image contour to correct and supplement the missing contour information. The proposed model can accurately extract the saliency target contour from the complex background using the semantic information and correction module.
3.1. Saliency Image Fitting Energy
This paper uses the pyramid feature attention network [
28] to acquire the saliency images. However, due to the low contrast of underwater images, there were some errors in the saliency detection results, such as a local inhomogeneous intensity, background noise, and missing contour information. In view of the local inhomogeneous intensity of the saliency images, we preliminarily employ the local binary fitting to construct the energy functional
:
where
is the saliency images,
is a contour in the image domain
, and
and
are the image local fitting intensities near the point
. The local fitting intensities
,
can be expressed as follows [
29,
30]:
where
is the Gaussian kernel,
is the saliency images, and
is the Heaviside function
can be expressed as:
However, the local binary fitting may introduce some local minimums and is sensitive to noise. Affected by the accuracy of saliency detection, the saliency map of underwater images will inevitably have background noise. Moreover, the initialization curve greatly affects the segmentation results. To solve the aforesaid problems, we introduce the global fitting term from the CV model into the energy functional
. The local-global fitting intensities can be expressed as follows:
where
and
are the mixed intensity,
and
are two constants derived from Equation (3), and
is a weight coefficient
. According to the test images in this paper, the value of
can be taken from 0.5 to 0.9. Moreover, the more inhomogeneous the image intensity, the smaller the value of
.
With the level set representation, the energy functional can be expressed as follows:
The improved fitting energy not only takes the local intensity information into account but also avoids the local minimization. Therefore, for the saliency images of underwater images, the improved energy functional can extract the contour of the inhomogeneous images more accurately.
3.2. Original Image Fitting Energy
The local inhomogeneous intensity and noise problems can be solved by fusing the local intensity fitting and CV model. However, the missing contour information of the saliency image still needs to be solved. Therefore, the original underwater images are used to make up the missing contour information.
In this paper, we use the local image fitting model (LIF) [
27] to extract the contour of original underwater images. The energy functional
can be expressed as:
where
is a local fitted image, as shown in Equation (6). Although the models, such as LBF [
29,
30], LGIF [
31], and RMPCM [
3] can extract the target contours of underwater images very well, as shown in
Figure 1, the LIF model has higher efficiency. This higher efficiency is because the energy functional of the LIF model does not include a kernel function. Moreover, the LIF model can fit the original image well while reducing the noise significantly by minimizing the difference between the fitted image and the original image.
As shown in
Figure 1, the LBF, LGIF, and LIF models could extract the target contour better, but LBF was more sensitive to the initial contour curve (as shown in the green dashed area). The energy functional of LGIF and RMPCM both involved the kernel function. The kernel function performs more than one convolution operation for each iteration step, so the evolution speed is slow. The running time of the above models is shown in
Table 1.
Figure 1 and
Table 1 intuitively show that the LIF model has advantages regarding both the speed and contour extraction results. So, we use the LIF model to extract the original image contour to correct the contour information of the salient target.
3.3. Dual-Fusion Active Contour Model
To use less fitting energy at the target contours than at other locations, we use an edge indicator function [
32,
33] to indicate target contours. The function can be expressed as follows:
Then, we define the dual-fusion intensity fitting energy functional as follows:
where
is a weight coefficient
, and
and
are the original images’ fitting energy functional and the saliency images’ fitting energy functional, respectively.
Finally, the dual-fusion intensity fitting energy functional
can be obtained by combining Equations (14)–(17):
Then, we minimize
with respect to
to obtain the corresponding gradient descent flow [
29,
30,
31]:
where:
where
and
are the original images and the saliency images, respectively.
represents the integrated local and global intensities, and
and
are averages of the image intensities in a Gaussian window inside and outside the contour.
3.4. Regularize the Level Set Function
As pointed out by Zhang’s method [
27], Gaussian filtering can replace the traditional regularized term to regularize the level set function. Therefore, the smoothing process of the level set function can be expressed as:
where
is the standard deviation, and
is the time-step.
In fact, the smoothing effect of the level set function by Gaussian filtering is slightly worse than the traditional regularized term and is greatly affected by the time-step. However, the computing efficiency of Gaussian filtering is much higher than the traditional regularized term.
4. Results and Discussion
This section tested the proposed method on intensity-heterogeneous underwater images captured from underwater videos downloaded from the NATURE FOOTAGE website and Fish Dataset. Moreover, the method was compared with some state-of-the-art contour extraction methods in terms of its efficiency and accuracy. All contour extraction results were produced on the same computer to ensure fairness. The computer was configured as Intel(R) Core(TM) i7-8650U CPU @ 2.11 GHz, 16.00 GB memory, Windows 10 system, and x64 processor. MatlabR2017a was the software platform. We used the same parameters
and time-step
. The initial level set function was defined by:
where
is a constant (in our experiments,
) and
and
represent the region inside and outside of the contour
, respectively.
Moreover, the parameter is a constant, which controls the influence of the saliency image fitting energy and original image fitting energy. When the missing information of the saliency target contour is severe, should be relatively larger; otherwise, should be a small value. Moreover, should be smaller when the intensity inhomogeneity of the saliency image is severe. This is because the local intensity fitting can better segment the target in the intensity inhomogeneity region, and the results of contour extraction rely on the local intensity fitting. Otherwise, should be larger to suppress the noise interference. In the experiment, we need to choose appropriate values for and according to the degree of inhomogeneity and the degree of saliency detection deviation. In the experiment of this paper, the value of was from 0.5 to 0.9, and the value of was from 0.1 to 0.8.
4.1. The Benefits of Local-Global Intensity Fitting
A comparative experiment was performed to prove the effectiveness of the local-global intensity fusion in
Section 3.1. We conducted different experiments, as shown in
Table 2.
In experiment A, the fitting intensity of the energy functional is the local intensity. In experiment B, the fitting intensity of the energy functional is the global intensity. Moreover, the energy functional with the fusion local-global intensity is shown in Experiment C. The contour extraction results of the experiments are shown in
Figure 2.
As shown in
Figure 2, experiment A extracted the target contour in the intensity inhomogeneity region, but the result was greatly affected by the initial contour curve (blue circled area) and was sensitive to noise (green circled area). Moreover, the method of experiment A also extracted the contours of the non-boundary regions. Experiment B could extract the target contour in the intensity homogeneity region and was not disturbed by noise, but the target contour in the intensity inhomogeneity region could not be extracted. By comparing the segmentation results in
Figure 2a–c, it can be seen that the fused energy functional (experiment C) can not only effectively eliminate the influence of the initial curve and noise interference but also effectively segment intensity-inhomogeneous regions.
4.2. The Effect of Original Image Correction
Figure 3 shows the result of our method in the underwater image segmentation. As shown in
Figure 3, the coordinate points
and
are located at the saliency target edge in
Figure 3b. However, in
Figure 3c, the coordinate point at the same position is inside the target instead of on the target edge. This error is caused by the deviation in the saliency detection. Therefore, it is necessary to use the original image to supplement the missing information. This paper used the local image fitting method to extract the contour information of the original image. It then used the contour information to correct the deviation caused by saliency detection. The result of the correction is shown in
Figure 3e. As shown in
Figure 3e, the missing contour information of the saliency image is accurately supplemented, and the background information is filtered out.
4.3. Performance of the Dual-Fusion Active Contour Model
Figure 4 shows the performance of our method. It can be seen from
Figure 4d that our method can filter out the background information and accurately extract the target contour.
Figure 4b shows the saliency images of the original underwater images. The red circles represent the intensity inhomogeneity region, the yellow circles represent the noise region, and the green circles represent the missing region of the target. For the regions of the intensity inhomogeneity and noise, our method can still extract the target contour well using the local-global intensity fitting term. Moreover, the saliency image of the first image obviously lacks part of the target information (green circle region). Our method can still extract the complete target contour by integrating the original image contour information.
4.4. Qualitative Comparison
4.4.1. Comparison of the Segmentation Results with Other Models
To verify the effectiveness of the proposed method, we compared the segmentation results with other classic models, such as LBF [
29,
30], LGIF [
31], LIF [
27], and RMPCM [
3], respectively. The comparison results are shown in
Figure 5.
It can be seen from
Figure 5 that the LBF model is limited by the initial contour curve and cannot completely extract the target contour. The LGIF model is minimally affected by local background noise due to the fusion of the global intensity fitting, but it still cannot accurately extract the target contour. The LIF and RMPCM models can completely extract the target contour, but they are greatly affected by background noise and local target features. Our model introduces semantic information to filter out background noise very well. Moreover, because of the global-local intensity fitting, our method can handle local inhomogeneous regions without interfering in the local target features. In addition, the target contour of the original image perfectly complements the missing semantic information.
Furthermore, we compared the segmentation results with Zhu’s method [
15] and Chen’s method [
20], which also introduced saliency images as semantic information. Since we could not obtain the source codes of Zhu’s method [
15] and Chen’s method [
20], to ensure the fairness of the comparison results, we adapted the segmentation images from Zhu’s method [
15] (2017,IEEE) and Chen’s method [
20] (2019,IEEE) as the comparison images. The comparison results are shown in
Figure 6.
As can be seen in
Figure 6, even though our method, Zhu’s method [
15], and Chen’s method [
20] all introduce semantic information, our method can extract the target contour more accurately than Zhu’s method [
15] and Chen’s method [
20]. As shown in the blue circle region of
Figure 6a, our method extracted the target contour in the detail region more accurately. This is because we added the local-global fitting term to better extract the contours of local inhomogeneous regions, and the original image correction module can correct the errors in semantic information. As shown in the green circle region of
Figure 6b, our method can filter out background noise better than Chen’s method [
20] and is more robust.
4.4.2. Comparison of the Saliency Segmentation Results with Other Models
To further verify the superiority of the proposed method, we also compared the contour extraction results of the underwater image with the saliency image as the input of several classic models. To test the robustness of the proposed method, we only selected low-quality saliency images (inhomogeneous local intensity and incomplete saliency information) for the comparison experiments. As shown in
Figure 7, the segmentation results of LBF are severely affected by the initial contour curve and are disturbed by the inhomogeneous regions inside the target. The LGIF model can avoid the influence of the initial contour curve but cannot extract complete contour information, as shown in the green dotted region in
Figure 7(2). The LIF model can extract the target contour relatively completely, but it easily falls into the local optimum and is also affected by the initial contour curve. The RMPAM model avoids the local optimum error, but it also has the problem that the contour information cannot be extracted completely, as shown in the green dotted region in
Figure 7(2). Our method can effectively avoid the local optimum and supplement the missing contour information through the original image. Hence, the results of our method are more accurate and complete than other methods.
4.5. Quantitative Comparison
In the following experiment, we compare the proposed method with the aforementioned methods using several evaluation indexes to conduct a quantitative analysis. Here, three evaluation indicators, namely the mean absolute error (MAE), the error rate (ER), and the detection rate (DR), are employed for quantitative comparison. The MAE, ER, and DR can be expressed by the following equations:
where
and
represent the length and width of the image,
is the result of image segmentation, and
is the hand-crafted ground truth. So,
represents the contour that is accurately extracted by the model and
represents the union of the image segmentation result and the ground truth. The larger the
, the more contour that is correctly extracted.
represents the pixels that are incorrectly extracted, so the larger
is, the more pixels are incorrectly extracted. So, the smaller the value of
, the more accurate the result of contour extraction. The smaller the value of MAE, the more accurate the contour extraction result. Moreover, a large value of
can indicate that the contour extraction result of the model is accurate. The evaluation results of the aforementioned 5 methods are shown in
Table 3,
Table 4 and
Table 5.
A smaller value of MAE represents a higher contour extraction accuracy. According to
Table 3, the contour extracted by the proposed model obtained the smallest MAE value, which shows that the proposed model can extract target contours more accurately than the other 4 models.
Table 4 shows the error rate (ER) of the 5 methods. The ER values between the target contour extracted by the proposed method and ground truth are the smallest, so the proposed method has the highest accuracy.
Table 5 shows the detection rates of the above five methods. The detection rate represents how many contour pixels are correctly extracted. Therefore, our model with the highest detection rate can extract the target contour more accurately.