2.1. Resolution Model of a Stereo Vision System
A binocular stereo vision system typically comprises two identical monocular vision systems, the principle of which is illustrated in
Figure 1. The binocular system comprises two monocular systems, and each monocular vision system comprises an optical imaging system (
c1 or
c2) and an imaging detector (CCD1 or CCD2). The optical imaging system (
c1 or
c2) is not necessarily a camera and can be another complex optical imaging system equivalent to a thin lens without considering aberration. Any type of imaging detector can be used, provided that it can capture the image of the object. The two optical systems are assumed to be identical. The image distance is f; the center distance (baseline distance) of the two imaging planes is 2
L; and
o1 and
o2 are the intersection points between the optical axes of the left and right monocular imaging systems and the corresponding imaging plane, respectively. The world coordinate system,
O-
xyz, was constructed by considering the midpoint of
o1o2 as the origin and the extension line as the
x-axis. The optical axis inclinations of left and right monocular imaging systems are
α1 and
α2, respectively, where
α1 =
α2 =
α. The imaging coordinate systems of the left and right monocular imaging systems are
o1x1y1z1 and
o2x2y2z2, respectively, with
o1 and
o2 as the origins, respectively. The optical axes of the left and right monocular imaging systems are
o1z1 and
o2z2, respectively. Point
P is the point to be measured and is located in the FOV of the binocular stereo vision system. The imaging points on the two imaging planes are points P
1 and P
2, whose coordinates are (
x1,
y1, 0) and (
x2,
y2, 0), respectively. Point
p is the intersection of point P projected vertically onto the
c1c2o1o2 plane; hence, point
P has the same
x- and
z-coordinates as point
p. Point
p forms a straight line parallel to
o1o2 on the
c1c2o1o2 plane, which intersects
o1z1 and
o2z2 at points
p1 and
p2, respectively. A perpendicular line that passes point
p intersects
o1z1 at point
b.
Before analyzing the factors affecting the resolution of the visual system, the definition of resolution should be presented. We define the resolution of a binocular stereo vision system as the minimum displacement of an object that can be effectively recognized by the visual system. Based on the characteristics of the visual system, resolution can be defined as the displacement of an object when a perceptible change occurs in the imaging plane of the left or right monocular system.
As shown in
Figure 1, the coordinate of object point
P (
x,
y,
z) in the imaging coordinate system
o1x1y1z1 of the left visual system is (
s1 × cosα
1,
y,
z × sec
α1 +
s1 × sin
α1), and the coordinate of object point
P in the imaging coordinate system
o2x2y2z2 of the right visual system is (
s2 × cos
α2,
y,
z × sec
α2 −
s2 × sin
α2), where:
Based on the similar triangle principle, we obtain:
Based on the definition of the resolution of the visual system, when point P propagates along any of the three directions (x, y, or z), the coordinates of the image points in the left and right cameras change accordingly. If the coordinate changes of one of the image points can be recognized, then it indicates that the vision system can distinguish changes in the position of point P, which is the resolution. The 3D resolutions of the visual system are independent of each other; therefore, we can calculate the resolution in three directions. The resolutions of the visual system in the x-, y-, and z-directions are denoted as ΔX, ΔY, and ΔZ, respectively.
Suppose that point
P propagates only slightly along the
x-direction. In this case, to solve the displacement of image points in the left and right visual systems, the coordinates of point
P should first be represented by the left and right imaging coordinates. By substituting Equation (1) into Equation (2), we obtain:
When
P propagates only along the
x-direction and does not change along the
y- and
z-directions, the derivatives of Equations (3)–(6) with respect to
x1,
y1,
x2, and
y2 can be obtained as follows:
Assuming that the left and right visual systems are identical, the minimum perceptible change on the image plane of the left and right visual systems is Δ
w. Based on the definition of resolution, the resolution in the
x-direction can be written as:
Here, Δw is a pixel without considering the subpixel algorithm. However, with the development of the subpixel algorithm, Δw can be one-tenth of a pixel or even smaller, where w denotes the sampling interval.
Similarly, for the resolutions in the
y- and
z-directions, the imaging coordinate system of the left and right visual systems is used to represent the
y- and
z-coordinates of point
P, respectively, and the derivative can be obtained as follows:
Subsequently, the resolutions in the
y- and
z-directions of the visual system are expressed as:
A 3D resolution model of the visual system was developed. The theoretical resolution of any visual system can be obtained by combining the requirements of the FOV. Based on the model, the resolution of the visual system is not only related to the internal parameters of the visual system but also directly related to the external parameters of the visual system. The left and right visual systems are typically placed in parallel in an application environment featuring a large FOV. Therefore, the resolutions of a visual system with parallel optical axes are expressed as follows:
Based on Equation (20), one can intuitively form the following conclusions: 1. The resolution of the parallel optical axis vision system is proportional to the sampling interval. 2. The resolution of the parallel optical axis vision system in the x- and y-directions remains the same over the entire FOV. 3. The x- and y-resolutions of the parallel optical-axis vision system are proportional to the test distance, whereas the z-resolution is proportional to the square of the test distance. 4. The z-direction resolution of the parallel optical-axis vision system is inversely proportional to the baseline distance.
The relationship between the resolution and FOV of the visual system can be shown more intuitively. The focal length of the left and right lens was assumed to be 16 mm; the sampling interval was 0.2 μm, i.e., Δ
w = 0.2 μm; the FOV of the visual system was 1 m; the baseline distance, 2
L, was 2 m; and the test distance of
z was 2 m. Hence, the resolution of the monocular visual system in the
x-direction was 24.8 μm. The resolution in the
y-direction was also 24.8 μm, and the
z-direction resolutions of the monocular and binocular vision systems are shown in
Figure 2. As shown in
Figure 2d, the binocular vision system with parallel optical axis indicated the worst resolution (49.2 μm) at the center of the FOV and the best resolution (32.8 μm) at the edge of the FOV. However, as shown in
Figure 2c, the
z-direction resolution of the monocular system was monotonic. Based on the left visual system as an example, the greater the distance from the optical axis of the left visual system, the higher the
z-direction resolution.
Figure 2a represents the change amount along the
x1-direction on the image plane of the left visual system when point
P changed along the
z-direction, which is consistent with
Figure 2c.
Figure 2b shows the change in the value of point
P along the
y1-direction on the image plane of the left visual system when point
P changed along the
z-direction. When point
P was in the center of the FOV, the displacement of point
P in the
z-direction did not reflect the change in the
y1-direction. At the edge of the FOV, the minimum displacement of point
P along the
z-direction was 98.4 μm. In other words, when point
P propagated along the
z-direction, its image point was displaced along the
x1- and
y1-directions on the image plane, and the displacement along the
x1-direction exceeded that along the
y1-direction.
To analyze the influence of other parameters on the resolution, we first give the parameters of the visual system as follows: the measurement distance is 2.5 m, the FOV is 1.5 m, the focal length is 16 mm, the sampling interval is 0.5 μm, and the size of the detector is infinite. The variable parameters are the optical axis inclination and the baseline distance of the visual system. First, assume that the optical axis inclination is 0° and the baseline distance changes from 1.5 m to 2.5 m. At this time, according to the resolution model, the resolution in the
x- and
y-directions is 78.125 μm, and the resolution in the
z-direction is shown in
Figure 3. It can be seen that the larger the baseline distance, the higher the
Z-direction resolution. The baseline distance increased from 1.5 m to 2.5 m and the
z-direction resolution increased from 260.4 μm to 156.3 μm. In theory, the resolution can be improved by increasing the baseline distance, but in practical applications, the increase in the baseline distance will inevitably lead to reduction in the FOV, so the resolution cannot be greatly improved by this method. We assume that the baseline distance is 2 m, and the optical axis inclination increases from 0° to 10°. At this time, the resolution of the visual system in
x-,
y-, and
z-directions can be calculated according to the resolution model, as shown in
Figure 4. As can be seen from the figure, the resolution decreases with the increase in the optical axis inclination. With the optical axis inclination increased from 0° to 10°, the resolution in the
x-direction decreased from 78.125 μm to 86.81 μm, the resolution in the y-direction decreased from 78.125 μm to 82.36 μm, and the resolution in the
z-direction decreased from 195.3 μm to 217.7 μm. Therefore, from the above analysis, it can be seen that for a visual system, the simplest and most effective way to improve the resolution is to reduce the sampling interval. There are two methods to reduce the sampling interval: one is to reduce the pixel size of the detector, and the other is to develop a new subpixel algorithm to subdivide the pixel size. But, at present, both methods are difficult to further break through.
In order to demonstrate the predictive capabilities of the 3D resolution model of a vision system, we constructed a vision system. The focal length of the lens used in the vision system is 8 mm (product model: HN-0816-5M-C2/3X). The imaging area of the imaging detector is 2/3 inch and the pixel size is 3.45 μm (product model: MER2-503-36U3M). The parameters of the displacement platform are as follows: the maximum stroke is 200 μm and the resolution is 20 nm. We first calibrated the field of view of the visual system, as shown in
Figure 5. We can see that the FOV in the
x-direction is about 315 mm, and we assume that the FOV in the
y-direction is also 315 mm. The measured distance can be calculated as 315 × 8/8.8 ≈ 286 mm. Due to the limitation of the stroke of the displacement platform, we placed the target at the edge of the FOV of the vision system, as shown in
Figure 6. At this time, the distance of the object from the optical axis was 96 mm. Assume that the minimum displacement recognized by the camera is 0.5 pixels, that is, 1.725 μm. The theoretical resolution calculated by the resolution model is 61.7 μm, 61.7 μm, and 183.7 μm in the
x-,
y-, and
z-directions, respectively. Due to the limitation of the stroke of the displacement platform, in this experiment, the movement of our displacement platform is carried out in a reciprocating way—that is, it first moves
A μm along the
x-direction, and then moves −
a μm along the
x-direction, repeated 5 times. The displacement platform moves in the same way in the
y- and
z-directions. In the actual test, the distance we shift in the
x- and
y-directions is 62 μm, i.e.,
a = 62 μm. The distance shifted in the
z-direction is 184.1 μm, that is,
a = 184.1 μm. The displacement results calculated by image-matching algorithm based on ZNCC are shown in
Table 1. As can be seen from the table, these displacements can be effectively identified, that is, the three-dimensional displacement resolution is (62 μm, 62 μm, 184.1 μm). The actual measured resolution is very close to the theoretical value, and only slightly larger than the theoretical value. This is mainly due to parameter error, noise, and other factors. Therefore, our model can be effectively applied to the traditional vision system to predict its resolution.
2.2. Principle of Visual Measurement Technology Based on Confocal Scanning Imaging
Based on the analysis presented in the previous section, the resolution of the visual system depends on the sampling interval; that is, the smaller the sampling interval, the higher the resolution of the visual system. However, in conventional vision systems, fixed detectors such as CCDs are used as imaging detection devices, and the pixel size is the sampling interval. Owing to factors such as processing technology and detector material, further reducing the pixel size of the detectors is challenging; consequently, the resolution of the conventional vision system cannot be easily improved. As shown in the section above, under the meter-level FOV, the resolution of the binocular system typically exceeds 10 µm. However, the structure of the vision technology is simple, the 3D displacement of the object can be calculated only by image matching, and the measurement FOV is relatively large.
Confocal imaging systems are typically used for microscopic measurements. These systems scan objects point by point and then perform single-point detection imaging for each scanning point. Therefore, the sampling interval for this technology depends on the scanning interval. The scanning interval can be adjusted based on the scanning frequency of the galvanometer and the sampling frequency of the data acquisition card. For existing data acquisition cards, the sampling rate can typically reach tens of megahertz, and a sampling interval in the micron or submicron level can be achieved under the meter-level FOV by matching the scanning frequency. Furthermore, confocal technology can achieve nanometer or sub-nanometer resolutions; however, it has a maximum FOV on the order of millimeters. A more detailed, confocal technique is a point-by-point scanning imaging technique. It illuminates the object point through a point light source and then collects the reflected light of the object point to image the object point. The original confocal technique is stationary like a fixed sensor, while the object moves in the plane through an x/y displacement platform to achieve scanning of the object. At this time, the three-dimensional information of the object cannot be obtained, and the axial displacement of the object cannot be obtained. In order to obtain three-dimensional information about the object, it is necessary to use the axial displacement platform to drive the object along the axis direction, and then use the x/y displacement platform to move the object in the plane and image the object, then repeat the above steps. In general, the number of axial movements is dozens or even hundreds of times. And then the three-dimensional information of the object is calculated through a certain algorithm. Moreover, confocal technology uses a point light source to illuminate an object and only illuminates one object point at a time. As the object moves in the xy plane, the object point illuminated by the point light source also changes. The confocal system collects the reflected signals of these different object points through the data acquisition card, and the sampling rate of the data acquisition card is very high, resulting in a very small interval between the two object points. Therefore, when the object point moves in the xy plane, confocal technology can be easily identified. When the object has axial displacement, confocal technology needs to obtain the three-dimensional information of the object before displacement, and then obtain the three-dimensional information of the object after displacement and calculate the axial displacement of the object through the relative change in the three-dimensional information. Therefore, this method is very slow, and with the development of technology, it is now common to use galvanometers to achieve two-dimensional scanning, and the two scanning methods are exactly the same. But axial scanning still requires the use of a displacement platform.
Although visual and confocal measurement technologies are unrelated, we combined them to propose a technology based on confocal scanning. Our technology combines the advantages of the two technologies, using confocal technology to scan the object to improve the resolution, and then using vision technology to calculate the three-dimensional displacement of the object without axial scanning. A schematic illustration of the monocular system is shown in
Figure 7. As shown in the schematic diagram, the technology is primarily composed of two components, i.e., a photographic lens, which serves to increase the measurement range of the system, and a confocal scanning imaging subsystem, which is composed of elements 3–12, as shown in the figure. A confocal scanning imaging module was used to replace the CCD imaging module of the conventional vision system, and a point-by-point scanning imaging method was used to realize the imaging measurement of the object. As shown in the schematic diagram, the imaging pixel was mapped to the object side, or the object was mapped to the image side. When the object point shifted slightly, the amount of movement was less than the corresponding size of the CCD pixel. Therefore, in the conventional visual system, the object point is imaged in the same pixel before and after displacement. Thus, even if the object point is displaced, the conventional visual system cannot recognize it. For vision technology based on confocal scanning imaging, the sampling interval can be reduced based on the sampling rate and scanning frequency, and submicron or smaller sampling intervals can be realized under a meter-level FOV. Therefore, for the confocal vision system, because the sampling interval was reduced, the object point was imaged at the 7th sampling interval (the first black pixel from left to right) before it shifted, and at the 10th sampling interval after it shifted. This implies that a slight displacement can be recognized.
We can treat the entire confocal module as an imaging detector whose role is to scan the image plane of the photographic lens. This is feasible because the diffraction effect is not taken into account when the resolution model is established, that is, the aperture of the photographic lens is treated as infinite. Therefore, imaging of the object by the photographic lens does not lose any information about the object, but only changes the size of the object. The scanning of the confocal module on the image plane is equivalent to a single-pixel detector scanning on the image plane. It is assumed that the focal length of the objective lens is
f2, the tube lens is
f3, the scanning lens is
f4, the scanning angle range of the galvanometer is
θ, the scanning frequency of the galvanometer is
m Hz, and the sampling frequency of the data acquisition card is
n Hz. According to the confocal scanning principle, the sampling interval is
m/
n ×
θ. Then, the minimum sampling interval on the front focal plane of the scanning lens is
f4 ×
m/
n ×
θ, and the sampling interval on the front image plane of the objective lens is:
where
k represents the magnification of the objective lens and the tube lens, which is called the first-order magnification. When
k = 1, the objective and tube lenses can be removed. Therefore, the 3D resolution model of vision measurement technology based on confocal scanning is expressed as follows:
where
d represents the test distance. The following can be seen from the formula: 1. The resolution of the vision measurement technology based on confocal scanning is proportional to the scanning frequency; that is, the smaller the scanning frequency, the higher the resolution of the system; 2. The resolution is inversely proportional to the sampling frequency; that is, the higher the sampling frequency, the higher the resolution of the system; 3. The resolution is proportional to the focal length of the scanning lens; that is, the smaller the focal length, the higher the resolution; 4. The resolution is inversely proportional to the first-order magnification rate; that is, the greater the first-order magnification rate, the higher the resolution.
Assuming that the focal length of the telephoto lens is 16 mm, the resolution of the matching algorithm has a sampling interval of 0.1, the field of view of the visual system is 1.5 m, the baseline distance is 2 m, the test distance
d is 2 m, the scanning angle of the galvanometer is 25°, the first-order magnification is 5, the focal length of the scanning lens is 367 mm, the sampling frequency of the data acquisition card is 20 MHz, and the scanning frequency is 20 Hz. The ratio between the sampling frequency of the data acquisition card and the scanning galvanometer is called the scanning sampling ratio, that is,
r =
n/
m. Through calculation, the resolution of the system in the
x-direction is 0.4 μm, and the resolution in the
y-direction is also 0.4 μm. The
z-direction resolution of the confocal binocular vision system is shown in
Figure 8. It can be seen that, compared with the traditional vision system, the resolution of the vision measurement system based on confocal scanning is increased by more than 50 times under the same telephoto lens. When the optical system remains unchanged, the resolution of the system increases with the increase in the scan sampling ratio. Since the
x- and
y-direction resolutions of the system are consistent in the full field of view, and the
z-direction resolution is consistent at the same
x position, the relationship between the system resolution and the scan sampling ratio is shown at
y = 0, as shown in
Figure 9. Among them, cs0.2 represents the resolution of the traditional vision system with a sampling interval of 0.2 μm. As can be seen from the figure, as long as the scanning sampling ratio is greater than 1.6 × 10
4, the resolution of the vision measurement system based on confocal scanning is theoretically superior to that of the traditional vision system. When the scanning sampling ratio is greater than 10
5, the resolution of the system can be broken down to less than 10 μm, and with the increase in the scanning sampling ratio, the theoretical resolution of the system will increase proportionally.
However, as can be seen from the schematic diagram, the function of the photographic lens is to ensure the FOV and measurement distance. In order to ensure that the public FOV is not less than 1.5 m × 1.5 m, the diameter of the FOV of the photographic lens is designed to be not less than 3.5 m, which also leads to the relatively large diameter of the image plane of the photographic lens. Therefore, if we use traditional optical design methods to design the entire optical part to achieve the fusion of vision technology and confocal technology, the aperture of some optical lenses in the instrument will be too large. To solve this problem, we designed the lens in the instrument as a telecentric lens, as shown in
Figure 10. This can effectively reduce the aperture of the lens, and the aperture of the lens can be limited to less than 180 mm by a certain design method. In the design, it is necessary to ensure that the FOV and pupil match between the lenses so as to make full use of the performance of each lens.