1. Introduction
Strawberries are a high-value crop in the economy of Florida. Based on a report from the U.S. Department of Agriculture, the value of production for strawberries in Florida was
$282 million in 2018, the second-largest in the United States [
1]. The strawberry harvest season runs from December to April and, during this time, flowers form and become fruit in subsequent weeks. There may be some major “fruit waves”, in which the production yield can vary greatly from week to week [
2]. Weather fluctuations are one of the main causes of this phenomenon. In the main production areas in central Florida, the mean daily temperature is 25 °C in early November, declining to 15 °C in the middle of January and rising to 21 °C in late April [
3]. The day lengths and temperatures are conducive to flower bud initiation [
4,
5]. In Florida, the fruit development period typically extends from three weeks to six weeks as the day length declines and temperatures sink below the average [
6,
7].
Due to the dramatic fluctuation in weekly yields, strawberry growers need to monitor the fields frequently in order to schedule proper labor and equipment, as well as other resources for harvesting, transportation, and marketing [
8,
9,
10]. Predicting yields quickly and precisely is vital for optimal management; particularly, to prevent farmers from suffering from insufficient labor. Accurate prediction can reduce the waste of the unpicked over-ripe fruit due to labor shortages in the best harvest time [
11]. At present, yield estimation is done manually, which is very time-consuming and labor intensive. In the early 2000 s, researchers found some relationships between environmental factors and strawberry yields and tried to build models for yield prediction. Døving and Mage [
12,
13] found that the climate conditions had more impact on yield during flower induction and flower differentiation periods than during flowering and harvesting periods. By using historical yield data from 1967–2000 in Norway, they found a strong correlation between strawberry yield level and fungicides used against
Botrytis cinerea. The correlation between yield levels and temperatures varied in different seasons [
14]. Misaghi et al. [
15] input the vegetation index, soil characteristics, and related plant parameters into an artificial neural network and developed a model to predict strawberry yields. MacKenzie and Chandler [
16] built an equation to predict the weekly yield by using the number of flowers and temperature data. The coefficient of determination (r
2) between actual and estimated yields was 0.89, but the number of flowers were still counted manually. Using near-ground images and machine vision to detect the strawberries, Kerfs et al. [
10] developed a fruit detection model which achieved a mean average precision (mAP) of 0.79. The images were captured by a manually held camera, so it took a long time to acquire data for the whole of the large field. Thus, an automated, efficient, and precise way to count the number of flowers and strawberries is strongly and presently needed for regular yield prediction.
In recent years, unmanned aerial vehicles (UAV) have been widely used in agricultural remote sensing [
17]. Candiago et al. [
18] mounted a multi-spectral camera onto a multi-rotor hexacopter to build orthoimages of crops with vegetation indices. Garcia-Ruiz et al. [
19] compared UAV-based sensors and aircraft imaging sensors for detection of the Huanglongbing (HLB) disease, and found that the accuracy was higher for UAV-based data, as the UAV could approach the citrus trees closer. Baluja et al. [
20] assessed the variability of vineyard water status, using both thermal and multi-spectral cameras on a UAV platform. Zarco-Tejada et al. [
21] used high resolution hyperspectral imagery acquired from a UAV for leaf carotenoid content estimation. Several reasons can be drawn for the popularity of UAVs: (1) a drone’s flying height can be controlled within 0.5–500 m, so it can get closer to the ground and obtain higher-resolution images [
22]; (2) UAVs have strong environmental adaptability and low requirements for weather conditions [
23]—they can capture high quality images even on cloudy or rainy days [
24]; (3) they only need a small amount of space to take off (multi-rotor and helicopters take off and land vertically, while fixed-wing ones can take off via ejection and land via parachute [
25]) and, thus, there is no need for an airport or launch center for UAVs; and (4) drones are becoming cheaper and easier to carry. Their modular designs make them easy to modify for various tasks in different situations [
26].
Object detection is a computer vision task that deals with image segmentation and recognition of targets, which has been extended to applications in agriculture. Behmann et al. [
27] introduced some machine learning methods, such as support vector machines and neural networks, for the early detection of plant diseases based on spectral features and weed detection based on shape descriptors. Choi et al. [
28] enhanced the illumination of image features illumination, based on contrast limited adaptive histogram equalization (CLAHE), which helped them to detect dropped citrus fruit on the ground and evaluate decay stages. The highest detection accuracy was 89.5%, but outdoor illumination conditions presented a significant challenge. Deep learning has recently entered the domain of agriculture for image processing and data analysis [
29]. Dyrmann et al. used convolutional neural networks (CNNs) to recognize 22 crop and weed species, achieving a classification accuracy of 86.2% [
30]. CNN-based systems have also been increasingly used for obstacle detection, which helps robots or vehicles to locate and track their position and work autonomously in a field [
31]. The framework of deep-level region-based convolutional neural network (R-CNN) [
32] combines region proposals, such as the selective search (SS) [
33] and edge boxes [
34] methods, with CNNs, which improved mean average precision (mAP) to 53.7% on PASCAL 2010. Christiansen [
35] used an R-CNN to detect obstacles in agricultural fields and proved that the R-CNN was suitable for a real-time system, due to its high accuracy and low computation time. Recent work in deep neural networks has led to the development of a state-of-the-art object detector, termed Faster Region-based CNN (Faster R-CNN) [
36], which has been compared to the R-CNN and Fast R-CNN methods [
37]. It uses a region proposal network (a fully convolutional network paired with a classification deep convolutional network), instead of SS, to locate regional proposals, which improves training and testing speed while also increasing detection accuracy. Bargoti and Underwood [
38] adapted this model for outdoor fruit detection, which could support yield map creation and robotic harvesting tasks. Its precision and recall performance varied from 0.825 to 0.933, depending on the circumstances and applications. A ground-robot system using the Faster R-CNN method to count plant stalks yielded a coefficient of determination of 0.88 between the deep learning detection results and manual count results [
39]. Sa et al. [
40] explored a multi-modal fusion method to combine RGB and near infrared (NIR) image information and used a Faster R-CNN model which had been pre-trained on ImageNet to detect seven kinds of fruits, including sweet pepper, rock melon, apple, avocado, mango, and orange.
The objective of this study was to develop an automatic near-ground strawberry flower detection system based on the Faster R-CNN detection method and a UAV platform. This system was able to both detect and locate flowers and strawberries in the field, as well as count their numbers. With the help of this system, the farmers could build flower, immature fruit, and mature fruit maps to quickly, precisely, and periodically predict yields.
4. Discussion
In order to quickly count and locate flowers and fruit in the strawberry field with the help of a normal consumer drone, we stitched the images captured by the drone together and transformed them to an orthoimage. An orthoimage is a raster image that has been geometrically corrected for topographic relief, lens distortion, and camera tilt, which accurately presents the Earth’s surface and can be used to measure true distance [
18,
54]. The quality of orthoimages mainly depends on the quality and overlaps of aerial images. The frontal overlap is usually 70–80% and the side overlap is usually no less than 60%. For the same overlap conditions, the closer the aircraft is to the ground, the higher the GSD of the images will be, which helps the detection system to perform better. However, it will also take more time and consume more battery power to take images at a lower altitude, which would reduce efficiency and drone life. The specific working altitude should be adjusted according to the environmental conditions and task requirements. Many studies [
25,
44,
55,
56] set the frontal overlap around 80% and 70% for the side overlap. Most of their drones flew above 50 m in height and had relatively low GSD values. In our experiments, the flight images were taken near the ground, so we set the frontal overlap to be 70% and the side overlap to 60%, in order to increase flight efficiency while ensuring the orthoimage building requirements.
Some distortions happened at the edges of the orthoimages, due to the lack of image overlap. More flight routes will be used to cover the field edge in our next experiment. There were also some blurred or distorted parts in the plant areas of the orthoimages. These were caused by the strong wind produced by the drone as it flew across the field. These were more common in 2-m height orthoimages, as the drone flew closer to ground. Most of the blurred or distorted parts happened in the leaf areas, which were more susceptible to wind than flowers and fruits; the flowers and fruits were barely affected. As a bonus, the wind could actually help the camera to capture more flowers and fruit hidden under the leaves, so more flowers and fruit were detected in the 2 m height orthoimages.
Object detection is the task of finding different objects in an image and classifying them. R-CNN [
32] was the first region-based object detection method. It selects multiple high-quality proposed regions by using the selective search [
33] method and labels the category and ground-truth bounding box of each proposed region. Then, a pre-trained CNN transforms each proposed region into the input dimensions required by the network and uses forward computation to output the feature vector extracted from the proposed regions. Finally, the feature vectors are sent to linear support vector machines (SVMs) for object classification and then to a regressor to adjust the detection position. Fast R-CNN [
37] inherited the framework of R-CNN, but performs CNN forward computation on the image as a whole and uses a region-of-interest pooling layer to obtain fixed-size feature maps. Faster R-CNN replaced the selective search method with a region proposal network. This reduces the number of proposed regions generated, while ensuring precise object detection. We compared the performances of R-CNN, Fast R-CNN, and Faster R-CNN on our dataset. The results showed that Faster R-CNN had the lowest training time, highest mAP score, and the fastest detection rate. So far, Faster R-CNN is the best region-based object detection method for identifying different objects and their boundaries in images.
In our detection system, we fine-tuned a Faster R-CNN detection network which was based on the pre-trained ImageNet model which gave state-of-the-art performance on split orthoimage data. The average precisions varied from 0.76 to 0.91 for the 2 m images and from 0.61 to 0.83 for the 3 m images. Detection for flowers and mature fruit worked well, but immature fruit detection did not meet our expectations. The shapes and colors of immature fruit were sometimes very similar to dead leaves, which was the main reason for the poor results. More images are needed for the future network training. Additionally, there were always some occlusion problems, where flowers and fruit hidden under the leaves could not be captured by the camera. This occlusion varied slightly in the different growth stages of strawberries; when more flowers turned to fruit, the leaves tended to expand larger in order to deliver more nutrients to the fruit. The occlusion was around 11.5% and 15.2% in our field in November and December 2018, respectively. Further field experiments are needed to identify different seasonal occlusions, so that we can establish an offset factor to reduce counting errors by deep learning detection.
We chose IDW for the interpolation of the distribution maps. IDW is a method of interpolation that estimates cell values by averaging the values of sample data points in the neighborhood of each processing cell. The closer a point is to the center of the cell being estimated, the more influence (or weight) it has in the averaging process. Kriging is an advanced geostatistical procedure that generates an estimated surface from a scattered set of points with z-values; however, it requires many more data points. A thorough investigation of the spatial behavior of the phenomenon represented by the z-values should be done before selecting the best interpolation method for generating the output surface. In many studies, Kriging interpolation has been reported to perform better than IDW. However, this is highly dependent on the variability in the data, distance between the data points, and number of data points available in the study area. We will try both methods with more data in the future, and better results may be obtained by comparing multiple interpolation results with actual counts in the field and acquired images.
5. Conclusions
In this paper, we presented a deep learning strawberry flower and fruit detection system, based on high resolution orthoimages reconstructed from drone images. The system could be used to build yield estimation maps, which could help farmers predict the weekly yields of strawberries and monitor the outcome of each area, in order to save their time and labor costs.
In developing this system, we used a small UAV to take near-ground RGB images for building orthoimages at 2 m and 3 m heights, where the GSD was 2.4 mm and 1.6 mm, respectively. After their generation, we split the original orthoimages into sequential pieces for Faster R-CNN detection, which was based on the ResNet-50 architecture and transfer learning from ImageNet, to detect 10 objects. The results were presented in both a quantitative and qualitative way. The best detection performance was for mature fruit of the Sensation variety at 2 m, with an AP of 0.91. Immature fruit of the Radiance variety at 3 m was the most difficult to detect (since the model tended to confuse them with green leaves), having the worst AP of 0.61. We also compared the number of flowers counted by the deep learning model and the manual count numbers, and found the average deep learning counting accuracy to be 84.1%, with an average occlusion of 13.5%. Thus, this method has proved that it can be used to count flower numbers effectively.
We also tried to build distribution maps of flowers and immature fruit on 13 April and immature and mature fruit on 27 April, based on the numbers and distributions calculated by Faster R-CNN. The results showed that the mature fruit map of 27 April had obvious connection with the flower and immature fruit maps of 13 April. The flower distribution map of 13 April and immature map of 27 April also showed a strong relationship, which proved that this system could help farmers to monitor the growth of strawberry plants.