Extensive experiments were conducted employing imagery from diverse sources to evaluate the efficacy of the proposed method for acquiring 3D information of underwater targets within the field of marine engineering. This section commences with an introduction to the algorithm’s operating environment, followed by an elucidation of primary parameter configurations in this study in
Section 4.1. The experiments in this section involved the utilization of the UW-Middlebury dataset, which is a customized variant of the Middlebury benchmark dataset tailored specifically for underwater environments, in
Section 4.2. We also conducted experiments on real underwater binocular image pairs spanning different scenes in
Section 4.3.
4.1. Settings
In our experimental setup, we employed a personal computer equipped with a Xeon E5-1620 CPU (Intel, Santa Clara, CA, USA) (clocked at 3.50 GHz with four physical cores). Parallel acceleration was achieved through the deployment of eight threads and a memory allocation of 36 GB. Our method was implemented using C++ in conjunction with OpenCV.
The configuration of our energy function is defined as follows. In Equation (
5), we adapt
to regulate the cost value scope, following [
3,
36]. To harmonize the impact of ZNCC and CT on the data term,
is adjusted to 0.5. The image window parameters for ZNCC and CT are set as
, which accommodates precisely 63 pixels, enabling the generation of a 63-bit binary code using census transformation. This encoding, nearly equivalent to 64 bits, allows for convenient bitwise operations to acquire the Hamming distance. The kernel area of the filter,
, is set to 41 × 41 in Equation (
4), following [
43].
The parameters governing the smoothness term are outlined as follows. In Equation (
2), we set
to equalize the data term and the smoothness term. The values of
and
in Equations (
11) and (
12) are according to LocalExp [
24], with
fixed at
and
set to 25. In addition, we have a color threshold
in Equation (
14), following [
37]. Our method has three distance thresholds
in Equation (
15). Each of these thresholds is directly proportional to the image’s width, aiming to take advantage of grids of various sizes to adapt to different textural regions. The number of iterations for plane refinement is set at
, following [
24].
4.2. Experiments on UW-Middlebury Dataset
The findings in [
44] establish a correlation between the underwater image degradation and the accuracy of disparity estimation. We introduce a deep learning rendering method [
45], which converts the Middlebury dataset [
46] into the underwater UW-Middlebury dataset, as shown in
Figure 6. The generated dataset utilizes natural light fields to emulate the features of underwater scenes, transforming ordinary images into underwater-style portrayals. This method capitalizes on the properties of natural light, effectively replicating the visual characteristics unique to submerged environments.
The images in the UW-Middlebury dataset exhibit significant color deviations from the original images. Alongside the color transfer, the depth-based turbidity simulator is utilized to generate the degradation characteristic of actual underwater imagery. We conduct the algorithm in [
47], a probabilistic network that offers freedom from the absence of ground-truth images, to enhance images within the UW-Middlebury dataset. This algorithm can adjust the contrast of the simulated images, thereby mitigating color distortion. In
Figure 6, ten sets of experimentally relevant images are depicted. The image pairs of “Piano” and “Motorcycle” in the dataset exhibit imbalanced illumination, simulating the challenges posed by artificial underwater lighting. Each set comprises the left image from matched pairs within the Middlebury dataset, the ground-truth disparity map, their counterpart in UW-Middlebury dataset, and the result after enhancement processing of the UW-Middlebury image.
4.2.1. Ablation Study
There are two optimization stages: the coarse matching stage and the fine matching stage. We can constrain the iterations for each stage to compare the impact of them. We designed multiple sets of quantitative experiments and employed three groups of images—Reindeer, Teddy, and Cones—sourced from the UW-Middlebury dataset. These images are, respectively, denoted by the abbreviations
r,
t, and
c in
Figure 7.
Our approach is geared towards minimizing the data term in favor of the smoothness term without delving into the intricacies of energy function computation. The cost function in Equation (
3) can also derive a label mapping
l for a propagation grid
by Equation (
22).
There are the same variables in Equations (
21) and (
22). In
Figure 7, in the first two iterations, label updates are steered by the cost function in Equation (
22), followed by the adoption of local expansion moves in Equation (
21) for updates in the subsequent iterations.
Due to the absence of the plane refinement, the propagation process in the fine stage fails to generate sufficient candidate labels. In contrast to the continuous and infinite label space
, algorithms involved in the fine stage can only engage with a very limited number of candidate labels. Therefore, it is more appropriate to view this process as a redistribution of preexisting pixel labels. Assigning appropriate candidate labels for pixels becomes highly challenging, thus preventing this stage from being executed independently. Therefore, increasing the iteration count in this stage exerts minimal influence on optimizing results. Substituting a cost function with an energy function for local expansion moves enhances algorithmic accuracy, though altering the optimization stages leads to less pronounced improvements in accuracy while notably extending runtime, as shown in
Figure 8.
Following the methods that employ a coarse-to-fine matching strategy, such as [
26], which comprise multiple coarse matching stages followed by one fine matching stage, our method undergoes a total of six iterations. The choice of iteration rounds represents a trade-off between accuracy and computational efficiency.
In
Figure 7, the legend “Prop” represents one coarse stage and one fine stage in the first two iterations and three coarse stages and one fine stage in the subsequent iterations. “Prop_r” signifies the use of these hybrid processes for conducting disparity estimation for Reindeer. Meanwhile, the legend “Exp” encompasses only coarse stages, regardless of whether the label-updating criteria account for the smoothness term. In ablation experiments to assess the roles of two matching stages, the coarse stage can be evaluated separately through multiple executions with “Exp” curves. Drawing from the abundant candidate labels yielded by the step of plane refinement in this stage, the fine stage can iteratively refine these labels, thereby shedding light on the function of the fine stage with “Prop” curves.
The results of our experiments on three groups of images indicate that our method readily converges when only the coarse stage is applied. The application of the fine stage can optimize the results of the coarse stage, resulting in a further improvement in accuracy in both of the two label-updating criteria.
In
Figure 9, we compared the visualization outcomes of the disparity maps for different matching stages with Cones. Within the framework of local expansion moves utilizing the data item and the energy function, the disparity maps resulting from the method using only the coarse stage and the method combining the coarse and fine stages are referred to as “Expan_d”, “Expan_e”, “Prop_d”, and “Prop_e”, respectively. Following the post-processing of “Prop_e”, the final disparity map is referred to as “Post_processing”, whose error rates are documented in
Table 1. In the right panel of
Figure 7, the second and sixth iterations of the “Exp_c” line correspond to the maps “Expan_d” and “Expan_e”. Similarly, the “Prop_c” line corresponds to the maps “Prop_d” and “Prop_e”.
Observations within Frame 2 revealed instances of error matching using the coarse stage. While introducing an energy function improved this issue, the algorithm still exhibited misalignment in the repetitive texture region. In the regions outlined in Frames 1 and 3, the algorithm exhibited mismatches near the boundaries of the target when utilizing the coarse stage alone. However, the introduction of a supplementary fine stage significantly improved algorithm performance in these regions, as shown in the disparity map “Prop_e”.
Parallelization of two optimization stages: When updating labels using Equation (
22), comparing CPU×8 and CPU×1, we observe a speed-up of about 3.5×. When updating labels using Equation (
21), comparing CPU×8 and CPU×1, parallel computation increases operational speed by roughly a factor of 4. Our algorithm does not outperform others in terms of running speed. The coarse matching stage of our algorithm consumes roughly twice the runtime of LocalExp, detailed in [
24], an algorithm with a similar framework to our coarse stage. This is primarily due to LocalExp’s simple cost functions, which are based on pixel color features, and the lack of division in the adaptive regions. Despite using an energy function rather than a cost function to update candidate labels in both matching stages, the modest increase in runtime suggests that the method’s computational complexity is primarily influenced by the cost function. The extended runtime of the fine matching stage should stem from the propagation process requiring more iterations within a center grid.
4.2.2. Results on the UW-Middlebury Dataset
We compared our method with LocalExp [
24], Zhuang’s [
21], and Lv’s [
3] methods. The Cones, Reindeer, and Teddy in the UW-Middlebury dataset were subjected to the cost function described in Equation (
3), while the remaining images were evaluated using a deep learning-based cost volume [
48]. We substituted the raw cost function
of Equation (
5) with the following function:
where
is a truncation coefficient to limit the range of cost values. Given a discrete disparity
d, the function
indicates an aggregation of matching costs for all pixels within an
square window centered on the point
s following [
24].
However, PaLPaBEL [
35], PatchMatch [
19], and SGBM [
41] are limited to applying their individual cost functions only for the whole dataset, influenced by their algorithm framework.
The error rates of different methods for disparity estimation on the UW-Middlebury dataset are illustrated. In
Table 2, we display the rankings for the
bad 1.0 metric, which measures the percentage of faulty pixels based on an error threshold of 1.0 pixels, using non-occluded regions as the metric.
Table 1 adapts the same metrics for the entire image area. Both tables show the best result in bold for each image.
Our method exhibits a clear advantage over the other algorithms used for comparison. Through the application of distinct matching costs to different image groups in our experiment, we conducted a comparative analysis of the disparity estimation results for Cones, Teddy, and Reindeer. The results show the advantages of our cost function in Equation (
3) while also illustrating the superiority of our label optimization process when compared to the matching errors for Adirondack and the six remaining images.
4.3. Experiments on Real Underwater Images
Two authentic datasets, sourced from deep-sea and nearshore environments, respectively, were utilized in this section’s experiments. We employed a real underwater dataset acquired by the Hawaii Institute of Marine Biology [
10]. The images have the sizes of
and
pixels.
The dataset is accessed on 20 February 2024 at:
https://github.com/kskin/UWStereo. In this scene, we directly used the raw underwater image pairs as input for the experiment. The images in this dataset were pre-processed by prior researchers. Consequently, our method directly employs them for analysis. In
Figure 10, the left column shows underwater images. From top to bottom are left images of Leaf and Seabed. The remaining columns depict the disparity maps generated by different algorithms. We compared our method with all algorithms employed in
Section 4.2, except for PatchMatch and SGBM, given their poor performance in
Section 4.2.2.
We also utilized another dataset previously captured by our research institute [
3], which is a color correction method designed for non-uniform lighting conditions. This dataset comprises five underwater scenes, which capture the feature of underwater images, showcasing color deviations resulting from underwater light absorption and scattering, as well as image quality degradation due to the uneven lighting condition. The dataset provides calibration data for the images and ground-truth meshes for the target objects obtained from a Kinect.
The dataset from our institute features a resolution of 4 K, processed using an enhancement algorithm [
49]. Following this, stereo rectification was conducted based on the calibration data obtained from the “MATLAB Stereo Camera Calibrator” [
50]. This dataset is accessed on 26 February 2024 at:
https://github.com/uwstereo/underwater-datasets. We conducted a comparative experiment on multiple stereo matching methods using calibrated images, as shown in the left column of
Figure 11. These underwater images are labeled Coralstone1, Coralstone2, Shell, Starfish, and Fish from top to bottom. The second column depicts the ground-truth 3D mesh of underwater targets. The remaining columns depict the disparity maps generated by different algorithms.
To elucidate the significance of each component of the algorithms, we subdivided the compared algorithms into their constituent elements, disparity representation, energy function, and optimization strategy, as detailed in
Table 3, irrespective of the specific details of the comparison algorithms.
PaLPaBEL [
35], a stereo matching algorithm based on a propagation optimization strategy, is not suitable for parallel computing. It is the only method that employs discrete disparity values for disparity estimation in
Table 3. On irregular surfaces, its resultant disparity maps exhibited block artifacts, with noise regions akin to those illustrated in Coralstone2 of
Figure 11, evident upon map scaling. The performance of other methods in irregular object surfaces does not present this problem, emphasizing the significance of utilizing 3D labels for disparity representation.
LocalExp [
24], which puts forward the local expansion moves, utilizes the same optimizer as our method. Furthermore, its algorithmic framework is akin to the approaches developed by Lv [
3] and Zhuang [
21]. Notably, it is the sole method in the table that employs an energy function based on color features, which comprise pixel color and color change in gradient. The results indicate that LocalExp yielded disparity maps characterized by significant noise and mismatched points when evaluated on two underwater datasets. Conversely, other stereo matching methods that employ relative pixel value information as the cost function exhibit superior performance. These comparative experiments demonstrate the benefits of our color-intensity-independent energy function when handling degraded quality in underwater images.
The coarse matching stage in our approach shares algorithmic structural similarities with the methods of Lv [
3] and Zhuang [
21]. The key difference lies in how these methods rely on random sampling within a central grid area to extract candidate labels, whereas our coarse matching stage utilizes segment information of cross-based patches across a grid to determine the range for extracting candidate labels. Experimental results on two sets of underwater images show that Lv’s method excels in deep-sea environments, as shown in
Figure 10, while Zhuang’s method performs better in nearshore environments, as shown in
Figure 11. The robustness of these methods across different underwater datasets is limited. In contrast, by incorporating segment-level information derived from cross-based patches into the extraction of candidate labels for both the expansion and propagation processes, our approach demonstrates superior performance across both datasets. With a smoother disparity map, the visual evidence confirms the reduced noise levels in disparity maps resulting from our method. Our results have fewer error regions (completely black areas in disparity maps), where the disparity value is 0 and equivalent to infinite depth.
To assess the accuracy of different methods, we performed triangulation of the objects to obtain point clouds, which are segmented in the disparity maps [
51]. A comparison of these point clouds with their corresponding ground-truth mesh involved an initial manual coarse registration, followed by fine registration using the iterative closest point (ICP) algorithm [
52]. The mean and standard deviation of the Hausdorff distance between the clouds and the meshes are used as evaluation metrics. The statistical findings for each method are summarized in
Table 4.
The superiority of our method in accurately reconstructing 3D underwater targets is highlighted in
Table 4. Compared to Lv’s method, the most accurate currently available method, our method has shown a significant enhancement in the average precision of target reconstruction. Furthermore, notable improvements have been achieved in the accuracy of deviations for each reconstructed point. These experiment results show that our method delivers favorable outcomes by generating dense disparity maps in demanding underwater conditions characterized by light absorption and scattering.