1. Introduction
The liquified natural gas (LNG) manipulator arm is critical equipment for loading and unloading petroleum and other fluids at ports. One of the daily tasks of the dock workers is to connect the LNG manipulator arm with the pipeline interface located on the ship. However, the traditional manual towing methods are not only time-consuming and labor-intensive, but they also have various potential safety hazards. For instance, when the epidemic situation was severe, frequent contact with foreign ships increased the risk of infections. In addition, with the intensification of the aging population, the labor cost is continuously increasing. Therefore, realizing an unmanned docking control of the LNG manipulator arm improves the safety and production efficiency of port production.
As compared with manual methods, the machine vision technologies have high efficiency and do not require the physical contact. Moreover, as compared with traditional surveying and mapping methods, the machine vision technologies have high flexibility due to a smaller number of steps required for setting markers. Che et al. [
1] proposed a high-precision and contactless method for measuring the parts assembly clearance by performing image processing based on machine vision. Gu et al. [
2] set up a visual measurement system to obtain the drilling bit wear characteristics and evaluated its wear condition. Millara et al. [
3] built a system based on laser emitters and cameras for dimensional quality inspection during the manufacturing process of railway tracks. Li et al. [
4] proposed a turning control algorithm based on monocular vision vehicle turning path prediction. The algorithm estimates the cornering trajectory of a vehicle based on the vehicle’s front axle length and front-wheel adjustment data.
The machine vision detection technologies have been applied in various fields, such as agriculture, manufacturing industry, and medical industry. Abhilash et al. [
5] developed a closed-loop machine vision system for the wire electrical discharge machining (EDM) process control. This system successfully predicts the wire breakage by monitoring the severity of the wire wear. Lin et al. [
6] used machine vision technology for detecting the surface quality of fluff fabric. It detects the fabric defects and pilling by establishing qualitative and quantitative evaluation models. Ropelewska et al. [
7] converted the images of beetroots to different color channels and used models based on textures selected for each color channel to discriminate the raw and processed beetroots. Nawar et al. [
8] proposed a prototype based on color segmentation with a SVM classifier to recognize the skin problems quickly at a low cost. Sung et al. [
9] developed an automatic grader for flatfish using machine vision. Keenan et al. [
10] proposed an objective grading system using automated machine vision for solving the problem, i.e., the histological grading of cervical intraepithelial neoplasia (CIN) is subjective and has poor reproducibility. Liu et al. [
11] established a semantic segmentation network that is applied to the scene recognition in the warehouse environment. Yang et al. [
12] proposed a detection method of bubble defects on tire surfaces based on line lasers and machine vision to eliminate driving dangers caused by tire surface bubbles. Lin et al. [
13] used the machine vision and artificial intelligence algorithms to rapidly check the degree of cooking of foods to avoid the over-cooking of foods. Im et al. [
14] proposed a large-scale Object-Defect Inspection System based on Regional Convolutional Neural Network (R-CNN; RODIS) to detect the defects of the car side-outer.
In addition to identifying the target pipeline interface, the docking also needs to measure the distance and relative pose relationship between the target and the manipulator arm. Therefore, it is necessary to use binocular stereo vision for obtaining the depth information of the target. Please note that the stereo vision technology has a wide range of applications in many fields. Tuan et al. [
15] proposed an approach based on deep learning and stereo vision to determine the concrete slumps in the outdoor environments. Afzaal et al. [
16] developed a method based on stereo vision for estimating the roughness formed on the agricultural soils, and the soil till quality was investigated by analyzing the height of plow layers. Kardovskyi et al. [
17] developed an artificial intelligence-based quality inspection model (AI-QIM) that uses the mask region-based convolutional neural network (Mask R-CNN) to perform instance segmentation of steel bars and generate information regarding steel bar installation by using stereo vision. Gunatilake et al. [
18] proposed a mobile robotic sensing system that can scan, detect, locate, and measure the internal defects of a pipeline by generating three-dimensional RGB depth maps based on stereo camera vision combined with infrared laser profiling unit.
After completing the stereo matching to obtain the point cloud, it is often necessary to process it according to the requirements; for example, plane fitting is required in this work. Some data analysis methods, such as principal component analysis (PCA) [
19], independent component analysis (ICA) [
20], and support vector machine (SVM) [
21,
22], are reviewed in [
23]. Li et al. [
24] developed an accurate plane fitting method for a structured light (SL)-based RGB-D sensor and proposed a new cost function for a RANSAC-based plane fitting method. Hamzah et al. [
25] presented an improvement in the disparity map refinement stage by using an adaptive least square plane fitting technique to increase the accuracy during the final stage of stereo matching algorithm. Yu et al. [
26] proposed a cylinder fitting method with incomplete three-dimensional point cloud data based on the cutting plane due to the influence of occlusion and noise. Kermarrec et al. [
27] used the residuals of the least-squares surface approximation of TLS point clouds for quantifying the temporal correlation of TLS range observations.
The identification and positioning of the petroleum pipeline interface and the judgment of the orientation of the interface are the primary links for accomplishing the automatic docking of the manipulator arm. Please note that these are the most important links in the entire docking process. This work explores an efficient, accurate, and flexible method based on binocular vision technology for the identification of petroleum pipeline interface and detecting its orientation according to the requirements of the docking of manipulator arm. In this work, the images are preprocessed, and the region of the petroleum pipeline interface is extracted by transforming the color space and setting the threshold of the image channels. The 2D coordinates of the interface’s inner contour and the center are obtained by the function, i.e.,
findContours, available in OpenCV. The 3D coordinates of the interface’s contour are obtained by using the binocular stereo vision. Then, the plane fitting algorithm based on the RANSAC algorithm [
28,
29] is used to fit the 3D coordinates of the interface’s contour. The algorithm based on RANSAC performs better than the least-squares-based methods presented in [
25,
27], when there is a presence of abnormal samples. Finally, based on the space plane equation, the orientation of the petroleum pipeline interface (the normal vector of the fitting plane) and the 3D coordinates of the interface’s center are obtained. In this work, machine vision technology is used to automatically detect and identify the position and direction of the petroleum pipeline interface, which provides basic data for the subsequent automatic docking of the LNG manipulator arm. Vongbunyong et al. [
30] presented an implementation of a precise docking maker technique for the localization of robots. This technique is based on Lidar sensor and setting geometrical markers. As compared with the method presented in [
30], the methods based on machine vision are more flexible and do not require special markers.
4. Stereo Vision Ranging System
4.1. The Principle of Target 3D Coordinate Acquisition
In order to obtain the 3D coordinates by using the binocular stereo vision, two cameras are first used to capture the same object from different positions to obtain two images. Then, the imaging deviation of the target feature points on the two images is calculated based on the principle of stereo matching and similar triangles. Consequently, the depth information of the feature points is obtained, and the 3D space coordinates are calculated. The principle of binocular stereo vision is shown in
Figure 7.
In
Figure 7,
f denotes the focal length of the left and right cameras,
OL and
OR are the optical centers of the left and right cameras, respectively, and
OLZL and
ORZR are the optical axes of the left and right cameras, respectively. In the ideal binocular stereo vision model,
b represents the distance between the parallel optical axes of the two cameras. The coordinates of point
P in the coordinate system
XLOLYLZL with the left camera optical center as the origin are
P (
Xc, Yc, Zc).
XlOlYl is the imaging plane of the left camera, and
XrOrYr is the imaging plane of the right camera. The imaging points of the point
P on the two imaging planes are
p1(
x1,
y1) and
p2(
x2,
y2), respectively.
According to the principle of similar triangles,
p1(
x1,
y1),
p2(
x2,
y2), and
P (
Xc, Yc, Zc) are related as follows:
It is evident from the expression that the key to obtaining the 3D coordinates of the target in binocular stereo vision is to obtain the disparity value.
4.2. Camera Calibration
Three transformations between the four coordinate systems are required to transform the 2D coordinates into 3D coordinates. These four coordinate systems include image pixel coordinate system (uOpv), image physical coordinate system (xOy), camera coordinate system (XcOcYcZc), and world coordinate system (XwOwYwZw).
As shown in
Figure 8, the image pixel coordinate system is a coordinate system that considers the upper left corner of the image as the origin. (
u0,
v0) represents the coordinates of the origin of the image physical coordinate system in the image pixel coordinate system.
The image pixel coordinate system is transformed into an image physical coordinate system by using the following expression:
where, (
u,
v) represents the coordinates of the target point in the pixel coordinate system, (
u0,
v0) represents the coordinates of the origin of the image coordinate system in the pixel coordinate system, and
dx and
dy are the physical dimensions of a single pixel, respectively.
According to the principle of obtaining the 3D coordinates of the target by using binocular stereo vision, the transformation relationship from the physical coordinate system of the image to the camera coordinate system can be obtained from (1) as follows:
where,
f is the focal length of the camera, (
Xc,
Yc,
Zc) is the coordinate of the target point in the left camera coordinate system, and (
x,
y) is the coordinate of the target point in the image coordinate system.
Please note that both the world coordinate system and the camera coordinate system are 3D coordinate systems. The transformation between the 3D coordinate systems can be performed based on rotation and translation. Therefore, the transformation between the world coordinate system and the camera coordinate system is realized based on the following expression:
where,
R3×3 is the rotation matrix of the camera coordinate system relative to the world coordinate system,
T3×1 is the translation matrix of the camera coordinate system relative to the world coordinate system, (
Xc,
Yc,
Zc) represents the coordinates of the target point in the left camera coordinate system, and (
Xw,
Yw,
Zw) represents the target point in the world coordinate system.
Formulas (5)–(7) can be used simultaneously to transform the world coordinate system into the pixel coordinate system:
where,
and
.
The purpose of camera calibration is to obtain the unknown parameters of (8). Among these parameters, fx, fy, u0, and v0 are the internal parameters of the camera. Therefore, the matrix where these parameters are located is called the internal parameter matrix. The matrix composed of the rotation matrix R3×3 and the translation vector T3×1 is called the external parameter matrix.
The camera calibration experiment performed in this work uses MATLAB 2017b. The calibration board used in the camera calibration experiment is a checkerboard with a grid side length of 25 mm, as shown in
Figure 9.
In this work, the method proposed by Zhang Zhengyou et al. [
40] is used to perform the camera celebration. This method calibrates the camera by detecting the corners of the checkerboard. It is noteworthy that this calibration method has low requirements in terms of equipment, and only needs to print a checkerboard, thus simplifying the calibration process and guaranteeing the accuracy of the calibration.
The images of the calibration board acquired by the binocular camera are shown in
Figure 10 and
Figure 11. The camera calibration results are shown in
Table 1.
4.3. Distortion Correction and Stereo Correction
The camera projects the objects on the imaging plane according to the projection principle. During this process, the difference between the manufacturing accuracy and the assembly accuracy of the lens leads to distortions in the image, which in turn affect the detection and recognition results. Therefore, it is necessary to remove the distortions from the image. The lens distortion can be divided into radial distortion and tangential distortion. Please note that these distortions need to be corrected separately.
The radial distortion is caused by the light rays being bent farther from the center of the lens as compared to near the center. The direction of the distortion is along the radius of the lens. The radial distortion model is established as shown in (9). The tangential distortion is caused when the lens is not parallel to the imaging plane of the camera. The model of the tangential distortion is shown in (10).
where, (
x0,
y0) denotes the original position of the distorted point on the image plane, (
x, y) is the new position after the distortion is corrected,
is the distance from the new position to the center of the lens,
k1,
k2 and
k3 are the radial distortion parameters, and
p1 and
p2 are the tangential distortion parameters.
In an ideal binocular stereo vision system, the left and right camera imaging planes are coplanar, and row aligned. However, in practice, the two planes are deviated. The purpose of the stereo correction is to correct this deviation. The stereo correction process is shown in
Figure 12.
The rotation matrix
R3×3 and translation vector
T3×1 obtained during the process of camera calibration represent the transformation required to convert the right camera coordinate system to the left camera coordinate system. In order to minimize the image reprojection distortion, we rotate the imaging plane of the left and right cameras by half
R3×3. The rotation matrices of the left and right cameras are represented by
rl and
rr, respectively. The relationship between
rl and
rr is mathematically expressed as:
After the first step of rotation transformation, the imaging planes of the left and right cameras are already coplanar (parallel), but there is still no row alignment. In order to achieve row alignment of the left and right images, a new rotation matrix
Rrect is constructed as:
T in
Figure 12 is obtained by the camera calibration and represents the translation vector from the origin of the right camera coordinate system to the origin of the left camera coordinate system, i.e., the origins of the two coordinate systems have achieved row alignment in the direction of the vector
T. Therefore, the rotation matrix
Rrect is constructed according to the vector
T.
The second vector
e2 is orthogonal to
e1, so
e2 can be obtained by computing the cross product of the unit vector in the direction of the main optical axis and
e1. After normalization,
e2 is expressed as follows:
The third vector
e3 is orthogonal to both
e1 and
e2.
After combining the two processes of stereo correction, the stereo correction rotation matrix of the left and right cameras is obtained by using the following expression:
After correcting the distortion and performing stereo correction, the reprojection matrix of the pixel coordinate system and the world coordinate system is obtained as follows:
where,
cx and
cy represent the
x and
y axis coordinates of the main point of the corrected left camera image, respectively,
f is the focal length of the camera, and
b is the baseline length.
Figure 13 presents the effect of stereo correction. The row alignment is achieved in the left and right images after rectification, but the cropping operation after rectification causes the loss of partial image content.
Table 2 shows the camera parameters after correction.
4.4. Stereo Matching
The stereo matching is a key link in binocular stereo vision. The purpose of stereo matching is to match the corresponding pixels in two or more viewpoints and calculate the disparity. The disparity of the pixel point is estimated by establishing and minimizing an energy cost function. The semi-global block matching (SGBM) algorithm is a stereo matching algorithm based on the semi-global matching (SGM) [
41] algorithm. As compared with the traditional global matching and local matching algorithms, the SGBM is a more balanced algorithm with a higher accuracy and computational speed. The stereo matching comprises four steps, including cost computation, cost aggregation, disparity computation, and post processing.
The SGBM algorithm combines SAD [
42,
43] and BT [
44] algorithms for computing the matching cost, thus increasing the robustness of the algorithm. The cost of BT algorithm is mainly composed of two parts, namely the cost associated with the image processing based on the
Sobelx operator and the cost associated with the original image. The edge information in the vertical direction of the image convolved by the
Sobelx operator is more obvious. The matching of the edge area can be optimized by calculating the BT cost after convolving the image with the
Sobelx operator. Please note that directly calculating the BT cost of the original image preserves the overall information of the image. Combining the two BT costs enhances the matching effect at the edges and the details of an image, while retaining a large amount of original image information. When the two aforementioned types of BT costs are added and combined, a SAD block calculation is performed, i.e., the cost value of each pixel is replaced by the sum of the cost values in the neighborhood. The following mathematical expression is obtained by combining the two aforementioned processes:
where,
p and
q denote the pixels of the left view of the camera,
Np represents the neighborhood of
p, (
u, v) represents the coordinates of the pixels in the left camera image,
Il and
Ir represent the gray values of the left and right camera image pixels, respectively,
and
represent the gray values on the sub-pixel points of the left and right camera images, respectively, and
d represents the disparity.
In the aforementioned process, for every pixel block in the left image, the matching window in the right image moves within the specified retrieval range, which means that the value of
d changes within a certain limit. The costs corresponding to different values of
d are recorded for cost aggregation. The energy cost function of cost aggregation is shown in (20). The goal of stereo matching is to minimize this energy cost function.
where,
p and
q represent the pixel point,
Np represents the neighborhood of
p,
Dp and
Dq represent the disparity values of
p and
q, respectively, the value of the
T […] function is 1, when the conditions in the parentheses are satisfied, and zero otherwise,
P1 and
P2 represent the penalty coefficients, and
P1 <
P2. Please note that the smaller coefficients enable the algorithm to adapt the situations where the parallax change is small, such as the pixels on an inclined plane and a continuous surface. On the other hand, the larger penalty coefficients enable the algorithm to adapt the situations where the parallax changes significantly, such as the situation where the pixel is at the intersection of the edge.
In order to efficiently find the minimum value of the energy function, the SGBM algorithm uses the dynamic programming to optimize the cost aggregation. Considering a single pixel
p, the cost along the path
r is computed as follows:
where, the second term compares the matching cost of points on the path to
p under different parallaxes and combines the penalty coefficients
P1 and
P2 for selecting the minimum cost. This enables to avoid the global cost aggregation operation and greatly reduces the amount of operation.
After cost calculation, the disparity value associated with the smallest matching cost of each pixel is selected based on the winner takes all (WTA) principle.
The SGBM algorithm obtains the disparity map after completing the above process. In order to make the disparity more accurate, post-processing is required, which mainly includes a uniqueness check, sub-pixel interpolation, and left-right check. Please note that the uniqueness check is used to eliminate the erroneous disparity values to avoid the local optima. The sub-pixel interpolation makes parallax smoother, and the left-right check is used to address the matching errors caused by occlusion. After finishing post-processing, the final disparity map is obtained.
Figure 14 presents a comparison diagram of the disparity map and the original image.
4.5. Plane Fitting of Petroleum Pipeline Interface
After successfully obtaining the petroleum pipeline interface’s edge contour and disparity map, the disparity values of the corresponding points are obtained based on the pixel coordinates of the contour. Afterwards, a set of 3D points is obtained by using (4). Since the pipe has depth information, the 3D coordinates of the interface’s center, which are directly obtained from the disparity map, have a gap with the interface’s plane. In order to correct this gap, it is necessary to fit the plane of the petroleum pipeline interface.
The random sample consensus (RANSAC) algorithm has the ability to estimate the optimal model from a set of datasets containing abnormal data. The RANSAC algorithm extracts a small number of data samples from a given dataset and obtains a parameterized model that satisfies these samples. A fault tolerance range is usually set during the verification process. When a data sample deviates from the established model within this fault tolerance range, it can be considered that this data sample conforms to the model. The RANSAC algorithm divides the data points into inliers and outliers. The inliers refer to the data samples that can adapt to a given parameterized model. On the other hand, the outliers refer to the data samples that are unable to adapt to the model. The execution process of the RANSAC algorithm is an iterative process. When a better model is obtained, the model parameters are updated. A threshold can be set for RANSAC for reducing the number of computations. When the proportion of the inliers reaches the threshold, the iteration can be terminated early.
The model to be fitted in this work is a space plane model. The general mathematical expression for a space plane is:
The normal vector of this plane is .
Three non-collinear points define a plane. It is only necessary to select three non-collinear points from the 3D point set to complete the calculation of the parameters of the space plane expression. Assuming that the three non-collinear points are selected from the 3D points set
P1,
P2, and
P3, two vectors
and
with the same starting point can be obtained from the coordinates of the three points. The normal vector of the plane equation where three points are located can be obtained by computing the cross product of these two vectors as follows:
The parameter D can be obtained by substituting any one of the three points in (15).
After obtaining the space plane equation, the accuracy of the model is verified. We calculate the distance from each point to the plane. If the distance between the point and the plane is less than the preset fault tolerance range, the point is accepted as an inlier, otherwise it is marked as an outlier. Then, we find and record the proportion of the number of inliers. The aforementioned process is repeated if the new model has a larger proportion of inliers. This indicates that the new model is a better model for a sample, and the parameters of the spatial plane equation are updated. The iterative process is stopped, and the parameters of the best model are recorded when the iteration reaches the upper limit set for the number of iterations, or the proportion of inliers reaches the set threshold.
The pixel coordinates of the center of petroleum pipeline interface obtained during the identification stage are (
ucenter,
vcenter). The corrected camera parameters are obtained from
Table 2. The physical coordinates of the center image of the petroleum pipeline interface are (
xcenter,
ycenter). The relationship between (
xcenter,
ycenter) and (
ucenter,
vcenter) is expressed as follows:
It is evident from (4) that the coordinates of the petroleum pipeline interface’s center between the camera physical coordinate system and the world coordinate system are proportional to each other. The relationship between these coordinates is expressed as:
where, (
Xcenter,
Ycenter,
Zcenter) are the world coordinate system’s coordinates of the center of the petroleum pipeline interface, (
xcenter,
ycenter) are the image physical coordinate system’s coordinates of the petroleum pipeline interface center,
f is the focal length of the left camera and
α is the proportionality coefficient.
The coordinates of the center of petroleum pipeline interface in the camera coordinate system satisfy the given optimal model. Therefore, the coefficient
α can be obtained by substituting the coordinates into the plane equation as:
Finally, by substituting α in (25), the world coordinate system coordinates of the petroleum pipeline interface’s center are obtained.
Figure 15 shows the fitting plane when the lens is facing the petroleum pipeline interface. It is evident from the image that the abnormal points do not interfere with the final model as the RANSAC plane fitting algorithm is based on the valid data samples, thus making this method more robust.
5. Experimental Design and Analysis
The camera used in the experiment is a fixed baseline binocular camera. The parameters of this camera are shown in
Table 3. The camera is connected with the computer through a USB3.0 data cable. The proposed algorithm is implemented in C++ and runs in the VS2017 environment configured with OpenCV. In order to comprehensively verify the accuracy of the binocular stereo vision pose detection system, this work verifies the accuracy of distance measurement and angle measurement.
During the verification experiment performed for estimating the accuracy of measuring distance, the binocular camera is placed on a tripod, and the bookend is placed in front of the camera, as shown in
Figure 16. The acquired images are input into the program for processing. Then, we compare the resulting depth values computed by the proposed method with the values of the distance obtained from the lens to the bookend measured by the laser rangefinder. In order to improve the reliability of the experimental results, this work considers 10 points from the bookend in each picture and obtains the average value of the depth values for these 10 points and uses it as the distance from the camera to the bookend. According to the binocular stereo vision ranging theory, the ranging accuracy is only influenced by the accuracy of the parallax. Therefore, the measurement accuracy of the depth also reflects the measurement accuracy of the horizontal and vertical directions. Therefore, it can be considered that the measurement accuracy of the depth represents the overall measurement accuracy of the system. The final experimental results are shown in
Table 4. As compared with the actual measured distance, the maximum error of the distance value calculated by the proposed method is very low, i.e., at centimeter level, and the relative error does not exceed 1%. This error is completely acceptable for large industrial equipment. Therefore, it is feasible to apply the binocular stereo vision to the intelligent docking of the manipulator arm.
The experimental platform of the angle accuracy verification is shown in
Figure 17. We identify the 3D printed flange model and place it on a bracket whose inclination angle can be adjusted. We adjust the bracket at different angles and capture the images of the model and use them as the input of the proposed method.
Figure 18 shows the experimental results of the pipeline interface with different inclination angles. The inclination angle of the petroleum pipeline interface plane obtained by the proposed method is compared with the inclination angle measured by the level gauge. The experimental results are shown in
Table 5. The absolute error of the angle measurement is within 1 degree. Although the accuracy of the stereo vision measurement meets the requirements of docking, please note that there are holes and dark parts in the disparity map, which are caused by large areas of weak texture and uneven illumination of the left and right cameras. This effect is limited in the edge region. However, it can be seen from the fitted plan that the smoothness of the 3D contour of the petroleum pipeline interface increases with an increase in the angle. This is because the gray values of the surface of the petroleum pipeline interface have obvious distribution as the angle increases. This greatly reduces the probability of mismatching in the stereo matching stage. Therefore, enhancing the matching accuracy is the only way for the system to be suitable for practical applications.