1. Introduction
Unmanned underwater vehicles (UUVs) play an irreplaceable role in various fields, serving as oceanic equipment suitable for underwater tasks. They are widely utilized for tasks including seafood fishing, subsea pipeline tracking, seafloor mapping, submarine cable laying, and marine resource exploration. However, as the scope of UUV missions continues to expand, the issue of their endurance has become a focal point. Due to the limited energy carried by UUVs, frequent charging becomes necessary. However, surface charging not only reduces the operational efficiency of UUVs and increases costs but also compromises their stealth capabilities during mission execution [
1,
2]. To address this challenge, underwater charging platforms have emerged, enabling UUVs to recharge without the need to surface. Currently, there is a wealth of research on UUV docking. Researchers have utilized navigation systems such as acoustics [
3,
4], optics [
5], and electromagnetics [
6,
7] to guide UUVs into docking station (DS). However, to the best of our knowledge, there are few methods available for guiding the charging stake to accurately insert into the UUV’s charging port after docking. Achieving automatic alignment of underwater charging platforms is a current trend in the development of underwater equipment technology and holds significant research and practical value.
Although there is limited research on automatic alignment of underwater charging platforms, the process of inserting the charging stake into the UUV’s charging port can be conceptualized as a peg-in-hole assembly. The solution to this problem can be broadly categorized into contact-based and non-contact-based methods. Contact-based methods [
8] typically involve the end of the shaft contacting the plane where the hole is located, followed by using a force sensor to search for the hole’s position on the plane. This method has low safety and can potentially damage the outer surface of the UUVs. Non-contact-based methods can be divided into methods based on laser alignment instruments, acoustic sensors, and vision sensors. The core components of a laser alignment instrument are a semiconductor laser that emits laser beams and a photoelectric semiconductor position detector that collects information about the position of the laser spot [
9]. Therefore, precise alignment can be achieved by installing laser alignment instruments on the hole axis. However, underwater small planktonic organisms cause light scattering, affecting the alignment accuracy, and suspended particles in the water obstruct the laser, rendering the alignment process unable to continue. Due to the slow attenuation of sound waves underwater, acoustic sensors are widely used in various underwater positioning and navigation tasks [
4], but their alignment accuracy is lower in short-distance scenarios. Vision-based alignment, on the other hand, corrects alignment deviations through visual feedback, providing positioning and guidance for fragile or easily disturbed objects without physical contact [
10]. It exhibits robustness in underwater environments and meets alignment requirements in terms of accuracy. Therefore, we adopt a vision-based non-contact alignment approach to guide the charging stake into the UUV’s charging port.
Alignment operations using vision sensors have been widely studied in various fields. For instance, Fan et al. [
11] proposed a laser-vision-sensor-based method for initial point alignment of narrow weld seams by utilizing the relationship between laser streak feature points and initial points. They obtain a high signal-to-noise image of the narrow weld seam using the laser vision sensor, and then calculate the 3D coordinates of the final image feature point and the initial point based on the alignment model. Finally, they control the actuator to achieve the initial point alignment. In another study, Chen et al. [
12] developed an automatic alignment method for tracking the antenna of an unmanned aerial vehicle (UAV) using computer vision technology. The antenna angle is adjusted using the relative position between the center of the UAV image and the center of the camera image fixed on the antenna. The two image centers overlap during antenna alignment. Similarly, Jongwon et al. [
13] designed a vision system that uses three cameras to locate the wafer’s position for wafer alignment.
Underwater image processing techniques are of great importance in the field of underwater charging platform alignment due to the problems of low contrast, blurred edges with blue-green tones, and other problems. Traditional underwater image processing techniques can be divided into two categories: image enhancement and image restoration. Image enhancement algorithms [
14] include histogram equalization, white balance, Retinex, wavelet transform, etc. These algorithms can render underwater objects clearer by enhancing image contrast and denoising. Image restoration techniques recover images by solving two unknown variables in the Jaffe–McGlamery [
15,
16] underwater imaging model: the transmission map and background light. For example, the dark channel prior algorithm (DCP) proposed by He et al. [
17,
18], which simplifies the JM model by introducing a priori knowledge that the dark channel value of a clear, fog-free image is close to zero. Variants of the DCP algorithm [
19,
20,
21,
22] have been developed and optimized over time, achieving better results. In recent years, convolutional neural networks have made remarkable achievements in multiple fields such as image classification [
23], object detection [
24], and instance segmentation [
25]. Increasingly, many networks are being used to process underwater images. Some of these networks are end-to-end [
26,
27,
28], which output the recovered image directly after inputting the original image, while others use deep learning to derive some of the physical parameters of the underwater imaging model and then perform image restoration [
29]. This method has good performance and strong robustness but is not suitable for situations with limited hardware resources.
Accurate pose estimation is a primary prerequisite for successful alignment. Pose estimation technology recovers the position and orientation of an object by observing the correspondence between its image and features [
30], which can be divided into three types: corners, lines, and ellipses (circles). Luckett et al. [
31] compared the performance of these three features and found that the accuracy and precision of corner and line features increase as the distance decreases, but in high-noise environments, ellipse features have the strongest robustness. To address the issue of ellipse detection accuracy, Zhang et al. [
32] improved the circle-based ellipse detection method and designed a sub-pixel edge-based ellipse detection method. This improved the accuracy of ellipse detection, especially in cases where the ellipse is incomplete. It was the first to prove that improving the accuracy of ellipse edges helps to improve the detection accuracy of ellipses. Huang et al. [
33] proposed a universal circle and point fusion framework that can solve pose estimation problems with various feature combinations, combining the advantages of both features with high accuracy and robustness. Meng et al. [
30] proposed a perspective circle and line (PCL) method that uses the perspective view of a single circle and line to recover the position and orientation of an object, which solves the duality and restores the roll angle.
We proposed an automatic alignment method for an underwater charging platform based on monocular vision recognition. After the UUV enters the underwater charging platform, the method accurately identifies its number and guides the charging platform to move towards the target direction using target recognition. This is achieved by calculating the deviation between the current position of the target keypoints and the target position, which aligns the charging stake with the UUV’s charging port. The main contributions of this paper are as follows:
1. A single-camera visual-recognition-based UUV underwater alignment method is proposed that includes an encoding and decoding method for encrypted graphic targets, a method for determining the two-dimensional coordinates of the target location, and a target recognition algorithm, which can guide the charging stake on the charging platform to smoothly insert into the UUV’s charging port.
2. The method can adapt to underwater environments and has certain robustness to partial occlusion. Additionally, this method requires less computational resources, lower hardware requirements, shorter processing times, and satisfies real-time control requirements. Moreover, the detection accuracy of this method meets the requirements for smooth alignment.
The rest of this paper is organized as follows.
Section 2 describes the proposed single-camera visual-recognition-based UUV underwater alignment method in detail.
Section 3 presents the experimental results of the proposed method. In
Section 4, we analyze the experimental results from
Section 3 and describe the shortcomings of our method. Finally,
Section 5 presents our conclusions.
2. Methods
The structure of the charging platform utilized in this method is illustrated in
Figure 1, where both the camera and charging stake are attached to the axial sliding table, which is fixed on the circumferential turntable. Meanwhile, the UUV is secured onto the alignment platform. Due to the positioning of the alignment platform, the camera’s distance from the UUV target at the target position is a known, fixed value. Consequently, the two-dimensional coordinates of the target’s keypoints on the UUV remain unchanged at the target position. By comparing the two-dimensional coordinates of the target keypoints at the current position and the target position, the direction of motion of the charging stake can be determined.
The proposed method comprises three stages, as shown in
Figure 2. In the first stage, the UUV’s identity information is decoded to obtain its number, which serves as an index to retrieve the registered information of the UUV within the charging platform. This information includes the UUV’s charging voltage, size, and target spray position. Subsequently, the UUV is firmly clamped onto the alignment platform by the clamping device of the underwater charging platform. In the second stage, the UUV’s size and target spray position are obtained based on the retrieved information, and the target position for docking is determined. This target position is the two-dimensional coordinate on the camera imaging plane where the keypoints of the UUV’s target are located when the charging stake on the underwater charging platform can be inserted into the UUV’s charging port. In the third stage, the keypoints of the UUV’s target are recognized, and the charging stake is guided to move towards the target position by calculating the deviation between the current position and the target position. The stake is first aligned circumferentially and then aligned axially until the distance between the current position and the target position is within the allowable error range, as shown in
Figure 3.
2.1. Encoding and Decoding
2.1.1. Encoding
Before UUVs can charge or exchange information with underwater charging platforms, their identity information should be determined to identify their model and charging voltage, and to confirm their mission type and ensure secure information exchange. Therefore, a UUV identity information encryption and coding method is necessary to ensure information security. Firstly, the UUV number is expanded to three digits, with leading zeros added if necessary. The UUV number is denoted as
. Next, four coding values are obtained:
. These coding values are used to query their corresponding ArUco codes, which are graphic codes obtained by converting the numeric codes. Finally, the four ArUco codes obtained from the previous step are rotated clockwise by 0°, 90°, 180°, and 270°, respectively, and their position information is added to each coding value through the ArUco code’s pose information. The resulting encoding pattern is shown in
Figure 4. This encoding method has some redundancy; when decoding, it is only necessary to recognize the position and ID of any two of the four coding values to infer the UUV number.
2.1.2. Decoding
Due to the challenging imaging conditions, the ArUco codes in the original images cannot be recognized directly. Therefore, this paper proposes a method for ArUco code detection, which involves image restoration, thresholding, and template-filling techniques to reconstruct the ArUco codes in the respective regions. The specific steps are illustrated in
Figure 5.
In the first step, the original image is subjected to image restoration. The core of the image restoration method involves solving for two unknowns,
and
, based on the underwater imaging model as represented by Equation (1). We adopt the method proposed in [
19] and use the difference between the red channel and the maximum value of the blue and green channels to estimate the transmission of the red channel. The transmission map of the blue and green channels is then obtained based on statistical analysis [
34], as shown in Equation (2). Furthermore, we apply the gray world assumption theory [
9] in the field of image restoration to estimate the background light, as shown in Equation (3). By substituting the calculated values of
and
into Equation (1), the restored image can be obtained.
where
is the original image,
is the restored image,
is the transmission map,
is the background light,
is the wavelength of light.
where
is a local patch in the image.
where
is a constant value that represents the desired mean gray value of the restored image, which is set to 0.5 in this paper.
In the second step, the restored image is subjected to a thresholding operation. Although the contrast of the restored image has been improved, it still does not meet the recognition criteria for ArUco codes. Traditional methods utilize contrast enhancement and image binarization techniques to assist in ArUco code recognition. However, these operations can amplify the noise in the image and result in the failure of ArUco code recognition. Therefore, this paper proposes an improved approach to local thresholding, as shown in Equation (4). This method creates small window
and large window
around each pixel in the image. It compares the mode of the pixel grayscale values in
with the average grayscale value of the pixels in
and returns the grayscale value of the pixel at the center. Compared to traditional binarization methods, this approach has the advantage of using the mode of the grayscale values in a small window centered around each pixel for comparison, which helps with noise reduction. Additionally, it categorizes all pixels into five classes instead of two (0 and 255), resulting in smoother transitions between grayscale values and enhancing the robustness of the ArUco code recovery process in the third step.
where
and
are sliding windows with radii of
and
, as shown in
Figure 6. In this paper,
= 10
.
In the last step, the ArUco codes are reconstructed. After obtaining the thresholded image, the regions of interest (ROIs) containing the ArUco codes can be determined. Each ROI is then divided into 36 equally sized rectangles, with the central 16 rectangles containing the encoding information of the ArUco codes. The color of the corresponding blank positions in the template is determined based on the average grayscale value within each rectangle. By filling in the template, the reconstructed ArUco codes are obtained. Finally, the ID and angle of the reconstructed ArUco codes are identified.
2.2. Determination of Target Position
Due to the complex underwater environment, fixing all UUVs to a position that allows the charging stake to be smoothly inserted into their charging port and recording the two-dimensional coordinates of the current target keypoints would require a lot of manpower and resources. Therefore, this paper proposes a method to determine the two-dimensional coordinates of the target keypoints underwater based on the target spraying positions on the UUV. This method first determines the above-water coordinates of the target position based on the target spraying position and then uses the law of refraction to determine the underwater coordinates.
2.2.1. Above-Water Coordinates of the Target Position
The schematic diagram of the two-dimensional coordinates of the water surface target position is shown in
Figure 7. The two-dimensional coordinates of the above-water target position refer to the coordinates of the keypoint of the target T in the o-uv coordinate system. During the UUV target spraying, the relative position between the keypoint of the target and the UUV charging port can be obtained, that is, the coordinates (
X, Y, Z) of the keypoints of the target in the coordinate system O-XYZ. Since the target position of the underwater charging platform is the position where the charging stake can be inserted into the charging port of the UUV, the coordinates
of the keypoint of the target in the coordinate system
are (
X, Y, Z + L). The value of L is determined by the type of UUV and can be obtained during the decoding process since the clamping device of the charging platform in this method will fix and move the UUV to a specific position. The process of converting the coordinates in the coordinate system
to the coordinates in the coordinate system o-uv can be regarded as the camera calibration process. The conversion process is shown in Equation (1):
where
represents the camera’s extrinsic matrix, which represents the position relationship between the world coordinate system
and the camera coordinate system
.
represents the camera’s intrinsic matrix, which represents the transformation relationship between the camera coordinate system
and the image coordinate system o-uv. The parameters
and
represent the scaling factors in the x and y directions, respectively. The parameter f represents the camera’s focal length, and
represents the principal point of the camera, which is the coordinate of the camera’s optical center in the image.
The camera’s intrinsic parameters can be obtained and image distortion can be corrected using Zhang’s calibration method [
35]. However, due to the installation errors of the camera, it is necessary to accurately determine the camera’s extrinsic matrix through further hand–eye calibration [
36].
Inspired by neural network concepts, this paper transforms the camera calibration problem into estimating the function f in the Equation (6). To achieve this, the paper controls the movement of the slider and captures 25 photos, as shown in
Figure 8. The coordinates of the four concentric circle centers in the world coordinate system
and their corresponding coordinates in the image coordinate system
are recorded for each photo. This resulted in 100 samples, where
serve as data and
as labels. Fifty percent of the samples were used for training, and the remaining fifty percent for testing. A single-hidden-layer neural network without activation functions was constructed, as shown in
Figure 9. The mean squared error (MSE) loss function was employed, and the external parameter matrix estimate was used to initialize the first layer of the network, while the calibrated internal parameters were used to initialize the second layer. The initialized network showed fast convergence, small oscillations, and ultimately converged to a smaller loss value.
After training, the network weights consist of two matrices,
and
, with dimensions of 4 × 3 and 3 × 3, respectively. Given a coordinate
in the world coordinate system, its corresponding coordinate in the image coordinate system can be calculated using Equation (7). It is worth noting that
and
obtained from training do not have physical meaning, and the intermediate variables
,
, and
in Equation (7) are not the coordinates of the point in the camera coordinate system.
2.2.2. Underwater Coordinates of the Target Location
The imaging principle of the underwater camera is shown in
Figure 10. In air, the light reflected by the target propagates in a straight line, and the size of an object with a size of
h projected onto the camera imaging plane is
a. However, underwater, due to refraction of light between different media, the size of an object with a size of
h projected onto the imaging plane is
b, and according to Snell’s Law [
37], as shown in Equation (8), since the refractive index of water is 1.333,
and
b.
where
represents the angle of incidence,
represents the angle of refraction, and
n represents the refractive index of the medium.
It can be inferred that the projection of underwater objects on the imaging plane can be obtained by magnifying the projection of objects above water with the camera center as the projection center by a certain factor. The magnification factor
can be obtained from Equation (9).
2.3. Target Recognition and Instruction Provision
This method utilizes a target designed by Tweddle et al. [
38], as shown in
Figure 11. The target consists of four concentric circles with area ratios of 1.44, 1.78, 2.25, and 2.94, and the keypoints of the target are the centers of the four concentric circles. During the alignment process, the camera image is preprocessed first, followed by contour detection to identify circles that may originate from the target according to their area and roundness. Then, based on the area ratios, the concentric circles are matched, and the coordinate values of each concentric circle are obtained. Finally, based on the relative position between the visible concentric circle center of the current position and the corresponding target concentric circle center, the charging station is guided to move until the coordinate difference between the current and target positions is less than the maximum allowable error, and the movement of the charging station is controlled to stop. During the alignment process, the x-coordinate of the concentric circle center is compared first to guide the charging station to rotate tangentially for azimuthal alignment, followed by the y-coordinate of the concentric circle center to guide the charging station to move axially for axial alignment.
To address the problem of low contrast in underwater images, this paper employs the Niblack binary thresholding method [
39]. It calculates the pixel threshold by sliding a rectangular window over the grayscale image [
40], as shown in Equation (10). If the pixel value surpasses the threshold, it is set as foreground; otherwise, it is set as background. Because the Niblack binary thresholding algorithm can adaptively adjust the threshold based on local blocks, it can preserve more texture details in the image. However, due to its need for more computational resources and time to process images, the Niblack binary thresholding method is computationally expensive.
where
T represents the pixel threshold,
represents the sliding window,
n represents the number of pixels in the window,
represents the grayscale value of each point in the window,
m represents the pixel mean value of all points in the window, and
k represents the correction parameter.
To accelerate the processing speed of Niblack thresholding, this paper is inspired by the ResNet network [
41] and adopts a structure similar to the bottleneck architecture that first reduces and then restores the size of the image. As the processing time required for Niblack thresholding is directly proportional to the size of the image, the size of the image is reduced to one quarter of its original size, and then Niblack thresholding is performed. Finally, the thresholded image is enlarged four times to restore its original size. Although the size of this image is the same as that of the original image, its information content is only one quarter of that of the original image, and its contour details are relatively blurred. Directly using this image for contour detection would lower the detection accuracy. Therefore, inspired by the coarse-to-fine idea in the LoFTR algorithm [
42], this paper uses the image to determine the region of interest (ROI) and performs Niblack thresholding on the ROI of the original image. This significantly reduces the number of pixels processed by Niblack thresholding, ensuring the real-time performance of the algorithm.
To address the problem of partial occlusion caused by bubbles and suspended particles in underwater images, this paper enhances the robustness of the algorithm by utilizing the redundancy of target information. During the alignment process, only one of the four concentric circles in the target needs to be identified, and then the two-dimensional coordinates of its center are compared with the center of the concentric circle corresponding to the target position area ratio, thus providing the motion instructions in both the circumferential and axial directions. If multiple concentric circle centers are detected, the center point of multiple circle centers is used for comparison.
4. Discussion
The experimental results of underwater encoding are presented in
Figure 12. Although some ArUco codes were not recognized due to occlusion, two ArUco codes were identified with IDs 31 and 6 and their yaw angles were 179° and −95°, respectively. Based on machine pose estimation, it can be inferred that ID 31 represents
; therefore,
and
. ID 6 represents
, thus
. Decoding yields the UUV number 123.
As shown in
Figure 13, the original image suffers from low contrast, blurry contours, and a color shift towards blue and green due to light absorption and scattering in the underwater environment. The method based on the gray world assumption theory [
43] corrects the color shift but fails to enhance the image contrast. The UDCP method [
44] intensifies the color shift towards green. The Shallow-UWnet method [
28] enhances the image contrast but introduces an additional color shift towards yellow. In contrast, the proposed method in this paper corrects the color shift while enhancing image details. As shown in
Table 1, by applying the proposed image restoration and thresholding methods, eight out of nine ArUco codes can be detected. When only the proposed image restoration method is applied, a maximum of four codes can be detected, while applying only the proposed image thresholding method can detect up to six codes. The best result of the remaining methods is achieved by combining the Niblack and the gray world assumption theory methods, which detects five codes. Therefore, both the proposed image restoration method and thresholding method effectively improve the detection rate of ArUco codes.
As shown in
Figure 14, Cao’s method [
46] exhibits a fast convergence rate during the initial stages of training, but it also shows significant oscillations during the convergence process. In contrast, the proposed method in this paper has a slower convergence rate but demonstrates a stable convergence process, with the final loss value consistently reaching a smaller value. The minimum testing error achieved by Cao’s method [
46] is 0.3828 pixels, while the proposed method in this paper achieves a minimum training error of 0.0036 pixels.
Figure 15b presents the difference between the predicted underwater target position and the actual underwater target position. The predicted underwater keypoints and the actual underwater keypoints overlap almost perfectly. The errors are shown in
Table 2, with a maximum error of 5.21 pixels, which meets the accuracy requirements for alignment.
Table 3 shows the average processing time per frame under different scaling ratios.
Figure 16 shows the data fluctuation under different scaling ratios. The algorithm proposed in this paper has the same processing results as no scaling detection, the data fluctuation is minimal, but the processing time is only 37.41% of the original time. It can meet the real-time requirements in the alignment control process.
The displacement changes of each hydraulic cylinder during the alignment process are shown in
Figure 17,
Figure 18 and
Figure 19. Finally, the hydraulic cylinder controls the piston rod to move up by 119 mm, indicating that the piston rod has successfully inserted into the hole on the UUV. As shown in
Figure 20 and
Figure 21, the error between the final alignment position and the target position in the x direction is −0.69661 pixels, and in the y direction it is −0.58738 pixels. Therefore, considering the calibration errors in
Figure 14, the underwater coordinate transformation errors in
Table 2, and the motion errors in
Figure 20 and
Figure 21, the maximum alignment error is 4.517 pixels. Based on the calculation of FOV using Equation (12), the maximum alignment error is 0.5548 mm, which meets the accuracy requirements.
Although the method proposed in this paper has successfully achieved the automatic alignment of the UUV underwater charging platform, there is still room for improvement. Firstly, for partially occluded targets, this paper uses redundant information processing methods in both the decoding and target recognition processes. However, accurately completing the occluded parts in real time would further enhance the robustness of the method. Secondly, in determining the camera’s intrinsic and extrinsic parameters, this paper still needs to use Zhang’s calibration method to determine the distortion coefficients and initialize the network parameters using the calibrated intrinsic parameters. This method is not concise enough. Finally, this paper did not explicitly express the camera’s intrinsic and extrinsic parameters. These three points will be our future research directions.