1. Introduction
Accurate and timely yield estimation can have a significant effect on the profitability of vineyards. Among other reasons, this can be due to better management of vineyard logistics, precise application of vine inputs, and the delineation of grape quality at harvest to optimise returns. Traditionally, the process of yield estimation is conducted manually. However, this is destructive, labour-intensive and time-consuming leading to low sampling rates and subjective estimations [
1]. Automating yield estimation is therefore the focus of ongoing research in the computer vision field [
2].
Current 2D camera techniques predominantly rely on distinct features of grapes, such as colour or texture, to identify and count individual berries within RGB (Red, Green, and Blue) images [
3,
4]. However, the accuracy of yield estimations from these approaches is greatly restricted by the proportion of grapes visible to the camera. Hence, occlusion of grapes is an issue. Additionally, errors in the sizing of grapes can occur unless the distance between the camera and the grapes is known.
An alternative technique, which has been reported to provide improved yield accuracy, has been to incorporate 3D information. Grape bunch 3D architectonic modelling has been performed from high-resolution 3D scans of grape bunches within lab environments. These have been achieved using commercial laser scanners [
5,
6] and blue LED structured light scanners [
7,
8,
9,
10]. These scans can be used to estimate volume, mass, and number of berries per bunch. However, these 3D scanners are costly, require significant time to capture viable point clouds, and their use is yet to be demonstrated within field environments.
High-resolution 3D scans of grapes and vines have also been achieved using multiple RGB images captured from different positions using structure from motion photogrammetry techniques [
11,
12,
13]. This method can be used with inexpensive equipment [
14] and data collection can be automated by mounting cameras on platforms such as robots or drones [
15]. However, generating photogrammetry scans requires significant computation load and time. Rose et al. [
12] quoted 8 h to generate a point cloud for one 25 m length of vine.
An alternative approach that has been investigated is to identify within an RGB image the location and size of individual berries within a bunch and use this information to model the 3D grape bunch architecture using spheres or ellipsoid shapes. Liu et al. [
16,
17,
18,
19] used a backing board behind the grape bunch when capturing the RGB images to aid with the segmentation of individual berries. Berry size was estimated by placing a chequerboard pattern on the board. This allowed the distance between the camera and the backing board to be measured using camera calibration techniques. However, this requirement for a backing board means it can only be used for handheld applications. Ivorra et al. demonstrated/developed a novel technique that utilised a stereoscopic RGB-D (Red, Green, Blue—Depth) camera to obtain berry size without having to use a chequerboard pattern. They combined the depth information with 2D image analysis to achieve 3D modelling of the grape bunches.
The potential real-time benefits of RGB-D cameras for grape yield estimation have encouraged researchers to investigate their use for grape yield estimation. A range of low-cost RGB-D cameras that can generate 3D scans in real-time has become available on the market in recent years. This has been driven by their use in a wide range of applications including gaming, robotics, and agriculture. The main technologies used are stereoscopy, Active Infrared Stereoscopy (AIRS), Structured Light (SL), Time of Flight (ToF), and Light Detection And Ranging (LiDAR). Stereoscopy is similar to human vision and uses parallax and disparity between featured in images from cameras that are spatially separated. Active infrared stereoscopy is similar but projects an Infrared (IR) pattern into the scene to assist with finding correspondences. This is particularly useful for cases where objects being scanned have low visible texture and/or are in low light conditions. Structured light detects distortions in a known projected IR pattern. Time of flight and LiDAR cameras both operate by measuring the time taken for emitted IR light to be reflected back to the camera. ToF cameras typically emit this light in a single pulse, while LiDARs typically measure by sweeping a laser. RGB-D cameras have been used for 3D imaging a range of different fruits [
20]. This includes several studies related to imaging grapes.
Marinello et al. [
21] used a Kinect Version 1 (V1) camera, which operates using IR structured light, to image grapes in a lab environment for yield estimation. Their results showed that the scanning resolution decreased significantly with the increased distance of the sensor from the grapes. Hacking et al. [
22,
23] also used the Kinect V1 for yield estimation in both lab and vineyard environments. They showed that the Kinect V1 gave a good correlation with grape bunch volume in the lab but struggle in the field environment. They suggested that this could be due to sunlight and the presence of leaves. They recommended that future work should investigate the performance of the Kinect V2, since it is a ToF camera and hence is more robust to sunlight conditions compared with SL cameras, such as the Kinect V1, which project IR patterns [
24]. An alternative approach could be to take measurements at night. This technique has been used by studies capturing traditional RGB images in vineyards [
3,
25].
Kurtser et al. [
26] used an Intel RealSense D435 RGB-D camera, which operates using AIRS technology, for imaging grapes bunches in an outdoor environment. They used neural networks for detecting grape bunches from the point clouds [
27]. Basic shapes (box, ellipsoid, and cylinder) were fitted to the point clouds. However, they reported relatively large (28–35 mm) errors in the length and width of these fitted shapes compared with the physical measurement of the grape bunches. These errors were reported to be affected by sunlight exposure. It would appear that in sunlight conditions, the projected IR pattern would not be viable meaning this camera would be acting as a stereo camera.
Ivorra et al. [
28] used a stereoscopic RGB-D camera (Point Grey Bumblebee2) for imaging grapes, as mentioned above. However, the 3D scans of the grapes from this camera were of poor quality. They suggested that this was due to difficulty in making the correct correspondence between the stereo image pairs. Yin et al. [
29] also used a stereoscopic camera (ZED) for imaging grapes. However, this was used to measure the pose of grape bunches for automated robotic picking rather than yield estimation.
This article presents the first benchmarking of the performance of multiple RGB-D cameras for use in grape yield estimation applications. This includes ToF cameras, which have not been used before in a grape yield estimation study. The benchmarking performance analysis was obtained by calculating error maps between high-resolution scans obtained using photogrammetry and those obtained by the RGB-D cameras. This includes an analysis of the cameras’ performance in and out of direct sunlight.
Previous studies [
21,
22,
23,
26,
27,
28] have only looked at volume errors for a grape bunch as a whole. However, in this work, depth map errors in the RGB-D scans of grapes are analysed at an individual grape berry scale, which has not been done before.
The ability to identify individual grapes from 3D scans would provide additional information for the yield and crop load estimation process. This could inform viticulturists of metrics such as berry size distribution and berry count per cluster. There is also the potential for more accurate volume estimates by 3D modelling of the grape cluster architecture. This has been explored by several researchers [
5,
6,
7,
8,
9,
10,
16,
17,
18,
19,
28] but not for RGB-D cameras. This might be because it has been thought that these cameras did not have sufficient precision [
5].
In this work, the ability of RGB-D cameras for detecting individual grape berries using Random Sample Consensus (RANSAC) is investigated. We are not aware of any reported works that have applied an algorithm such as RANSC with RGB-D camera scans for grape berry detection.
The remainder of the article is organised as follows.
Section 2 describes the experimental setup and data processing used. The results are presented in
Section 3.
Section 4 provides a discussion on the results. Finally, a conclusion is provided in
Section 5.
4. Discussion
The RealSense D415, which uses AIRS technology, was the most accurate camera indoors. However, it showed reduced performance outdoors. This is in line with the findings of Kurtser et al. [
26] that reported increased errors for the RealSense D435 AIRS camera with increased sunlight exposure. The ECDF plots shown in
Figure 14 indicate that the errors for the RealSense D415 increased outdoors but were still similar to that of the Kinect Azure and Intel L515 (after correcting for their distance bias using ICP). However, the RealSense D415 also had a significant increase in missing scan points when operated in direct sunlight. This is illustrated in
Table 3, where the percentage of missing scan points relative to the photogrammetry scan increased from about 1% to 14% when measurements were made outdoors. Additionally, the 3D shape of individual grapes was less pronounced, which would make it harder to identify and measure the size of the grapes. This might be because it was not able to use its projected IR pattern due to saturation by sunlight. Saturation of the stereo IR cameras may also have occurred. Moreover, the camera may have struggled with the dynamic range caused by direct illumination from the sun with shadows.
The Kinect V1 SL camera also had low depth errors for measurements made indoors. However,
Table 3 shows that it had about 20% of the scan points missing, which was the highest of any of the other cameras. This resulted in a smooth shaped scan of the grape bunch and did not display the valleys between grapes. This phenomenon can be seen in the plots presented by Marinello et al. and Hacking et al. [
21,
22,
23]. The Kinect V1 has a significant deterioration in resolution as the distance of the grapes from the camera increases, as reported by Marinello et al. [
21]. This appears to be related to the strong depth quantisation dependence on scan depth for this camera.
The Kinect V1 could not be used for scanning grapes outdoors in direct sunlight. This was expected since its projected IR pattern would have been saturated by the sunlight. Hacking et al. [
22,
23] had also reported issues with its performance when used outdoors. They had therefore suggested that the Kinect V2 should be investigated for outdoor grape bunch scanning since it would be more robust to sunlight.
The cameras that used ToF technologies were found to be more robust to sunlight conditions. Both the Kinect Azure and Intel L515 appeared to provide similar results indoors and outdoors in direct sunlight. The Kinect V2 had higher errors than the Azure and Intel L515. It was able to operate in sunlight but did have some issues with saturation resulting in scan points being missing. This may be addressed by adjusting the exposure in software.
The ToF and LiDAR cameras produced scans of the grapes that had a distance bias of about 8 mm and had a distortion in the shape of the scans of the grapes, which was not observed for the SL and AIRS cameras. The shape distortion for the ToF and LiDAR cameras makes individual grapes within the scan more prominent and easier to identify than the Kinect V1 and the RealSense D415. This distortion may therefore be beneficial for counting individual grapes. The plots in
Figure 8,
Figure 10 and
Figure 12 show that these distortion effects were largely removed when the grapes were painted. This indicates that the distance bias and shape distortions are due to diffused scattering within the berries of the transmitted light used by these cameras.
The Intel L515 LiDAR appeared to have slightly less distance bias and distortion compared to the two ToF cameras. The difference in distortion between the ToF and LiDAR cameras may be due to the process they used to emit light. ToF cameras emit light using a single wide-angle coded pulse and captures the returning light from a range of locations simultaneously as pixels. If this light pulse enters a grape and experiences diffused scattering, each pixel of the ToF camera corresponding to the grape will receive some combination of light entering across the entire surface of the grape visible to the camera. In contrast, LiDARs typically build up the point cloud in a scanning process making measurements at a single scan point location at a time. This means that the light detected by the LiDAR may be more localised within the grape compared with the ToF camera. Given the different methods used by the two types of cameras, it is perhaps understandable then that each would have a different distortion pattern.
There have been a few reports of ToF cameras having a distance bias in fruit due to diffused scattering. Neupane et al. [
37,
38] reported that ToF cameras provided distance measurements for mangoes, which were biased to be slightly too large, due to diffused scattering within the fruit. This distance bias increased over several days and was suggested as a means of measuring the ripeness of the mango fruit. Sarkar et al. [
39] used this phenomenon to investigate the ripeness of apples using a ToF camera and polarisers. However, we have not seen any previous report of a shape distortion in ToF camera scans of fruit. The fact that the shape distortion is so pronounced for grapes may be due to the comparatively smaller size of the berries and relatively higher translucent properties compared to the other fruit that has been investigated previously.
This raises the question, could the distortion of RGB-D cameras that use ToF technology be used to provide a non-destructive estimation of grape properties such as ripeness? Future work is planned to investigate how the distortion effects vary with berry ripeness and size. This might also give some insight into the potential of correcting the ToF and LiDAR scans for these distortions in post-processing.
The ability to identify individual grapes from 3D scans could be beneficial. It potentially could allow the number and size of berries in bunches to be measured. Additionally, it might allow more accurate yield estimation through 3D bunch architecture modelling. There have been several works that have used RANSAC to detect and size grapes. However, these works used high-resolution 3D scans captured using commercial laser and structured light scanners [
5,
7,
8,
9,
10] and using photogrammetry [
13], not depth cameras. Yin et al. [
29] used RANSAC to fit cylinder shapes to the ZED RGB-D camera scans of grape bunches. However, this was related to the pose estimation of the entire grape bunch for robotic harvesting applications and did not attempt to fit individual grapes.
The RANSAC algorithm was used in this work on both the photogrammetry and RGB-D camera scans. The RANSAC algorithm showed some promise for detecting individual grapes in the RGB-D camera scans. All of the RGB-D cameras gave similar median 2D positions for the spheres/grapes relative to photogrammetry, as indicated in
Table 4. However, the RANSAC algorithm produced fitted spheres with a smaller radius for the ToF and LiDAR cameras. This was to be expected given the shape distortion observed for these cameras.
The ability of RANSAC to correctly segment out individual berries was lower for the RGB-D cameras compared with that for the photogrammetry scans. As an example, in
Figure 15, it can be seen that the Kinect V1 shows multiple grapes close to each other that have the same colour. This indicates that the algorithm has failed to separate these particular berries out as separate spheres. In contrast, a much higher proportion of the berries are correctly segmented for the photogrammetry scan.
The RANSAC algorithm also identified more grapes in the photogrammetry scans compared to that in the RGB-D camera. This is particularly pronounced for the grapes located around the edges of the bunch. However, this would appear to be mainly related to the way the photogrammetry scans are obtained using images captured from a range of positions relative to the grape bunch. The RGB-D camera images shown here in contrast are captured from a single location. This means the RGB-D cameras see a lower proportion of the surface area of the grape bunch. Improved results could be obtained by merging multiple RGB-D camera scans taken at a range of positions and angles relative to the grapes. This could be achieved using SLAM or a similar point cloud alignment technique [
14]. This should then make the RGB-D camera scans more comparable to the photogrammetry scans.
Future Work
More investigation is needed to ascertain the optimal method of detection and sizing the grapes from RGB-D camera scans. Future work could look at fitting other shapes to the grape scans such as ellipsoids or a shape that is similar to the distortions due to diffused scattering effects for the ToF and LiDAR cameras. Additionally, custom-designed algorithms may be needed for these cameras. This may include correction of the distortion effects for these cameras.
The ToF and LiDAR cameras had slightly higher errors compared with the other two cameras indoors even when the grapes were painted or when the distance bias had been removed in post-processing. It is possible that these errors could be reduced if additional filtering of the flying pixels was performed. However, this could potentially result in removing real scan points partially in the valleys between individual grapes. It is also possible that the error analysis process used here is overestimating the errors slightly for these cameras.
Improvements in the error analysis technique used in this work could also be performed. The error in the RGB-D cameras scans was obtained by comparing their depth scans with those obtained using photogrammetry. There could be some small errors in these photogrammetry scans. It appears that these scans had some smoothing in the valleys between grapes in a similar manner to the RealSense D415. It would be interesting in future work to use an alternative scanning system such as a commercial laser scanner for obtaining the ground truth scans.
The method used to calculate the distance errors could be improved in future work, particularly for the scans where a distance bias is present. One option could be to project a line from the location of the RGB-D camera to a scan point in its depth scan. One could then calculate the point on the line which is closest to a scan point on the photogrammetry scan (or where it passes through a mesh surface obtained from the photogrammetry scan). The distance along the line from that point to the RGB-D scan point could then be used as the depth error.
This work was performed with green grapes. Some preliminary testing with red grapes indicated that these also had a shape distortion and distance bias that appeared similar to that observed in the green grapes. However, this was not investigated in detail and more work is needed with other types of grapes.
The measurements described in this work were performed in controlled lab type environments. This was appropriate for the type of investigations performed in this study. However, it should be noted that achieving a fully automated system in a real vineyard environment would be more challenging. For example, this would require segmentation to allow automatic identification of grapes from leaves and stems [
27]. There may also be occlusions by leaves or other grape bunches. More work is needed to address these types of challenges.
5. Conclusions
The Kinect V1 is no longer in production and hence is unlikely to be used in the future for grape yield estimation. However, it provides a comparison of the IR structure light technology with that used by other RGB-D cameras. The Kinect V1 was not able to function in direct sunlight. This is likely to be due to its projected IR pattern being saturated by sunlight. This indicates that RGB-D cameras that operate using IR structured light would only be suitable for measurements made at night or with a cover system that blocks out sunlight.
The Kinect V1 provided scans made indoors (out of direct sunlight) with relatively low errors for the parts of the grapes facing the camera. However, it did not capture portions of the grapes, particularly in the valleys between individual grapes. While this might be adequate for rough volume estimations using a convex hull or mesh of the grape bunch scan, it does make identifying and sizing of individual grapes within the scan difficult. This is illustrated in the RANSAC results where the segmentation process struggled to correctly separate out many neighbouring grapes. In addition, it appears that the depth scans for the Kinect V1 had a relatively high quantisation compared with the other cameras.
The RealSense D415, which uses active stereoscopy, provided the lowest errors of the cameras analysed. Its indoor scans did not have the missing scan points or quantisation that was seen in the Kinect V1. However, it smoothed out the valleys between the grapes making it harder to detect individual grapes from the depth scans. The scans made with this camera in direct sunlight had slightly higher errors and missing scan points. In future work, we would look at adjusting the exposure of this camera in software to see if this issue can be addressed. However, it appears that sunlight was saturating its projected IR pattern, meaning it was acting purely as a passive stereo camera. This might indicate that cameras that operate using the AIRS technology may not have any additional benefit for yield estimation made in sunlight conditions compared with RGB-D cameras which operate using just passive stereo technologies. This may be investigated in future work.
The ToF (Kinect V2 and Kinect Azure) and LiDAR (Intel L515) cameras provided the best ability to detect individual grapes compared to the other cameras. However, they produced 3D scans of the grapes which were biased to give depth distances that were too large. Additionally, these cameras also produced distortions in the scans in the form of peaks centred on each grape location.
The distance bias and shape distortion were removed when the grapes were painted. This indicated that the distance bias and distortion were the results of diffused scattering within the grape. Previous work such as Neupane et al. [
37] had reported measuring a distance bias for fruit using ToF cameras and have related this to the ripeness of the fruit. However, we are not aware of any previous studies which have reported a distortion in the shape of the scans of the fruit. It may be that this distortion is enhanced due to the small size of grape berries and their translucent properties.
The distance bias found in the LiDAR and ToF cameras scans of the grapes may not be an issue if one is only interested in the shape of the grape bunch. In fact, the distortion pattern makes it easier to identify individual grapes compared with the SL or AIRS cameras. However, more work is needed to investigate how much this distance bias and distortion affect the accuracy of grape volume/yield estimations. In our study, it did result in smaller detected berry diameters obtained using RANSAC compared with the other cameras. More work is needed to understand what factors such as ripeness, berry size, and variety play in the magnitude of the distance bias and shape of the distortion. With more understanding of these factors, it may be possible to use these distortions to perform non-destructive measurement of grape properties such as ripeness or possibly to correct for the distortions in post-processing.
In future work, we plan to investigate further the potential of the ToF and LiDAR cameras since they were less affected by sunlight and there is potential to utilise the distortion present in their scans for more accurately identifying individual berries. Additionally, there may be opportunities for using the distortion for non-destructive testing of berry properties.