1. Introduction
In recent years, image recognition techniques based on deep learning have been widely used in agriculture [
1,
2,
3]. With PyTorch [
4], TensorFlow [
5], and other supporting frameworks being widely used, deep learning is able to learn high-quality images efficiently and shows a leading performance in agricultural image recognition tasks [
6,
7], especially with regard to the application of convolutional neural networks (CNNs) in agricultural computer vision tasks. For example, Jiang et al. [
8] proposed a tomato recognition and localisation algorithm based on an improved YOLOv5 model, which significantly improved the detection accuracy and real-time performance of tomato recognition. Li et al. [
9] used a multi-scale Faster R-CNN model combined with RGB-D images to achieve efficient detection and counting rates for passion fruit. Milioto et al. [
10] constructed a model based on CNNs for identifying field weeds, which can effectively distinguish between crops and weeds under different lighting conditions. However, deep learning models require the support of large-scale datasets, which are difficult to provide in agriculture due to the limitations of crop growth. Therefore, image augmentation techniques are particularly important in image recognition tasks in agriculture [
11,
12,
13].
Image augmentation techniques are divided into two main categories: traditional image augmentation and deep learning-based image augmentation [
14]. Traditional image augmentation methods mainly use techniques such as geometric transformations and colour processing to perform various transformations such as rotation, scaling, flipping, panning, clipping, brightness adjustment, etc. For example, Shorten et al. [
15] randomly rotated images to increase the diversity of the training data and improve the robustness of disease recognition in a crop disease detection task, and Krizhevsky et al. [
16] used horizontal flipping to augment the dataset for crop leaf recognition in the classical AlexNet. Deep learning-based image augmentation methods mainly include generative adversarial networks (GAN), and variational autoencoders (VAEs). Among them, GAN is a widely used deep learning model in the field of image augmentation. It was proposed to generate new data with similar features to the training data, and in the past few years, GAN-based image augmentation methods have been widely explored and applied in the field of agriculture. Zhang et al. [
17] reconstructed the geometry of kiwifruit fruits on a synthetic dataset using conditional generative adversarial networks (CGAN), which provided valuable data for fruit detection and classification. Wang et al. [
18] focused on the classification and detection of wheat in their study by generating high-resolution images with more details through generative adversarial networks (GAN).
As one of the most important agricultural products in the world, tomato has a huge production and consumption in the world [
19]. It is important to study the growth of tomato plants during the reproductive period to improve yield and quality [
20,
21,
22,
23]. In particular, the study of the flowering and fruiting stages and the colour change and ripening stages of tomato not only helps to understand the growth mechanism of tomato, but also helps farmers and researchers to take timely measures to prevent pests and diseases, optimise planting strategies, and ultimately achieve increased production and income [
24]. However, significant differences in the growth time of tomato plants at different fertility stages have resulted in an uneven number of images collected during monitoring. Therefore, the objective of this study is to develop an image augmentation method for high-density tomato cultivation conditions in facilities to improve the quality of image datasets, overcome the uneven distribution of datasets caused by the different timings of crop fertility, and provide methodological support for improving the accuracy level of tomato growth monitoring. The method firstly detects the target of tomato flowers to obtain geometric positional information. Then, it implements different segmentation and screening strategies based on the coordinate information, and finally, randomly combines the processed regional images. In addition, considering the cost-effectiveness of agricultural production applications, low-cost visible light imaging is used as a tool, and two methods of in situ visible light cameras and mobile phone photography are considered to analyse the applicability of different data acquisition methods.
2. Materials and Methods
2.1. Data Acquisition and Pre-Processing
All images used in this study were collected from the sunlight greenhouse of Beijing Academy of Agriculture and Forestry Sciences. The selected tomato variety was “Jingcai 8”, and the cultivation method was coconut coir cultivation. The image collection time was from September 2023 to January 2024. Images of tomato growth were collected using both visible light cameras and mobile phones. In this case, the visible light camera model was H982 with a resolution of 2592 × 1944 pixels. The camera was fixed on a wire frame 30–40 cm away from the flower and 50 cm away from the ground, and was facing the tomato flower to take hourly timed pictures. The mobile phone model was iPhone 13 with a resolution of 4032 × 3024, and the manually handheld mobile phone was used to take pictures of the tomato flower.
Figure 1 shows an example of the image dataset under different shooting methods, containing different environmental conditions such as dim, bright, and occluded.
For this paper, the collected image dataset was pre-processed for processing. Firstly, 571 images with flowers were screened as training images and 100 of the images captured by the visible light camera were used as validation images. A total of 504 training images and 100 validation images were screened from the images captured by the mobile phone, and there were no duplicate images between the training set and the validation set to prevent the model from overfitting. In this study, the pixels of all images were uniformly adjusted to 1024 × 1024 pixels, and the image data annotation software LabelImg (version = 1.8.6) was used to draw the outer rectangle of the tomato flower target in all the images of the training set, to complete the manual annotation of the tomato flower, to categorise the detection target in the dataset as “flower”, and to convert the annotated structure into YOLO format. Finally, the dataset was divided into three categories, as shown in part A of
Figure 2. The first category is the image dataset captured by a visible light camera (Camera), the second category is the image dataset captured by a mobile phone (Phone), and the third category is the image dataset mixed from the first two datasets (Mix). Part B in
Figure 2 explains the specific flow of the augmentation method based on the geometric position of the salient features of the image, including the steps of target information segmentation, image region filtering, and random combination of images, the details of which will be presented in
Section 2.2.
2.2. Balanced Amplification of Salient Feature Information in Images Based on Geometric Location
2.2.1. Setting Balanced Amplification Targets
In the tomato images collected in this study, about 14 days passed from flowering to complete wilting of flowers, while about 40 days passed from the beginning of fruit set to fruit ripening, and the statistical results showed that the ratio of the number of images with flowers to the number of images with fruits was 1:4, and that the number of images with flowers was much less than the number of images with fruits. Training the model on samples with an uneven amount of image data will result in the model being biased towards classes with more data. Therefore, in this paper, we adopt the method of balanced image augmentation to balance the data distribution by proportionally augmenting the images with flowers. Each image with flowers is augmented by three times so that its number reaches a level comparable to the number of images with fruits to achieve class balance in the dataset to ensure the stability of model training.
2.2.2. Equalisation Process of Salient Feature Information Based on Geometric Location
During image acquisition, the small proportion of tomato flowers in the image and the fact that most of the image area consists of the plant and greenhouse background lead to an imbalance between the target information and background information. In addition, a large amount of irrelevant information will cause great interference to the training of the target detection model. Based on this, this paper proposes a significant feature information equalisation method based on geometric position, as shown in
Figure 3. The method is mainly divided into two parts: 1. target information segmentation based on visual detection; and 2. salient feature image screening based on geometric position.
The first part targets information segmentation based on visual detection. Traditional segmentation methods for flowers in an image are mainly based on the pixels of flowers in the image. However, using this method, it is difficult to accurately extract the pixels of flowers when faced with images that contain a large amount of irrelevant information. In addition, a flower consists of multiple parts, including the green receptacle, and the information of the whole flower cannot be completely extracted by the pixels in the image alone. In this paper, a target information segmentation method based on visual detection is proposed, which considers the pixel area of the detection frame of the target detection result as the pixel area of a flower spike, adopting different segmentation methods according to different pixel areas. Firstly, target detection is performed on the tomato flowers in the image, and the coordinate information of the detection box is output. The coordinate information consists of the upper left corner coordinate
and the lower right corner coordinate
. We calculate the length of the detection box using
and the width of the detection box using
, and multiply the two to calculate the pixel area S of the detection box, as shown in Equation (1).
refers to the total pixel area of the detection boxes in all images. Based on Equation (2), we calculate the average pixel area
of the detection boxes, where N represents the total number of detection boxes and
is the area of the
i-th detection box.
Different segmentation methods are carried out according to the relationship between the size of S and . When the area of the detection frame in the image is larger than , it indicates that the flower in the image occupies a larger area, and the 2 × 2 image segmentation method is adopted. When the area of the detection frame in the image is smaller than , it indicates that the flower in the image occupies a smaller area, and the 3 × 3 image segmentation method is adopted. The optimal segmentation of the target information in different cases has been achieved using these two methods.
The second part is geometric location-based salient feature image screening as shown in
Figure 4. Its main purpose is to remove regions of the image that contain less information about the target. All the images are of size 1024 × 1024 pixels. For 2 × 2 segmentation, the image is divided into four regions of 512 × 512 pixels, each containing part of the information of the detection frame. The centre coordinates of the image are denoted as (
), and the width and height of the detection box are represented by W and H, respectively. Based on the position coordinates (
) and (
) of the detection box, the coordinate information within each segmentation area to be determined. The the difference between the coordinates within each area and the centre coordinates of the image can be calculated to obtain the area of the detection box within each segmentation area. The obtained area is compared with
to remove segmentation areas with an area smaller than
.
After the first two parts of the operation, a large number of region images of 512 × 512 pixels and 341 × 341 pixels size are generated. For the images obtained by 2 × 2 segmentation, four are randomly selected for combination; for the images obtained by 3 × 3 segmentation, nine are randomly selected for combination, and the results are shown in
Figure 5. By combining these images randomly, the complexity and diversity of image samples can be effectively increased, and at the same time, the model’s ability to adapt to complex scenes can be enhanced.
2.2.3. Image Data Enhancement by Fusing Multiple Supervision Methods
Image data augmentation is a method of extending a dataset by generating similar data, which increases the amount of training data, improves the generalisation of the model, and helps to address the problem of sample imbalance. Enhancement methods include supervised data enhancement techniques as well as unsupervised data enhancement techniques. In this study, supervised and unsupervised image data enhancement methods are combined to perform balanced augmentation for tomato flower images to improve the balance of the dataset.
Supervised Image Data Enhancement
Supervised image data enhancement is the augmentation of image data based on existing image data using predefined image data transformation rules. In this study, three supervised image data augmentation techniques—angle transformation, add noise, and brightness transformation—were used to process the images.
- (1)
Angle transformation. In order to simulate more image viewpoints, four operations of rotating the image by 90 degrees, 180 degrees, 270 degrees, and mirroring are randomly chosen to process the original image. Angle transformation of the training set images can enhance the robustness of the training model to different imaging angles.
- (2)
Brightness transformation. The effect of light on the image when the image is captured makes some of the images have bright regions. Therefore, the brightness of the training set images is transformed, and the transformed brightness value is equal to the product of the coefficient k and the original brightness value , where k is randomly selected within the range of 0.5~1.0.
- (3)
Add noise. Appropriately adding random noise on the basis of the original image—the most common practice is Gaussian noise—which can effectively prevent overfitting and enhance the learning ability of the model. The model can better adapt to various data variations during the training process and improve its generalisation ability.
Unsupervised Image Data Enhancement
GAN is a generative deep learning model that generates data through adversarial training of a generator (G) and discriminator (D) [
25], as shown in
Figure 6. The generator receives random noise and converts it into realistic fake samples with the aim of deceiving the discriminator. The discriminator, in turn, is responsible for distinguishing between real and generated data, with the aim of accurately identifying their sources. The two compete with each other during the training process and eventually generate high-quality fake samples.
Images acquired using visible light cameras or mobile phones during the growth of tomato plants usually have complex backgrounds. In addition, tomato flowers make up a small percentage of the entire image, increasing the difficulty of image processing and analysis. Although supervised image-based data augmentation can expand the data volume in different ways, it also generates problems such as missing feature information and low image quality, for example. Therefore, in this study, image data enhancement is performed using StarGANv2 network, which is designed to transform images from one domain to another and can provide diverse images [
26]. The structure of StarGANv2 is shown in
Figure 6.
The StarGANv2 model includes a mapping net and a style encoder in addition to a generator and a discriminator. The mapping net is a fully connected network that maps hidden vectors to style features in different domains. The style encoder, on the other hand, is a multitasking style feature extraction network that provides diverse style features. The generator receives the output of the mapping network, the output of the style encoder, and the original image as input sources to generate the target image. The discriminator is designed as a multi-branch classifier that performs binary classification on each branch. Using the trained StarGANv2 model, the original facility tomato image can be image-transformed to generate augmented images with different backgrounds. In the same growth stage, multiple images can be generated from a single image through different backgrounds, thus achieving the purpose of balanced data augmentation.
2.2.4. Target Detection and Evaluation Indicators
In this study, the YOLOv7 target detection algorithm is used to test the effect of image equalisation augmentation. YOLOv7 is fast, accurate, and easy to train and deploy. The speed and accuracy of the network is in the range of 5–160 FPS, which exceeds currently known target detectors, is 120% faster than YOLOv5, and outperforms YOLOv5 on the MS COCO dataset.
Figure 7 shows the network structure of YOLOv7. YOLOv7 consists of three parts: spine; neck; and head. The images are first resized to 640 × 640 × 3 and fed into a backbone network consisting of a CBS composite module, an efficient layer aggregation network module, an ELAN module, and an MP module. The ELAN module enhances the model’s feature learning capability and robustness by optimising the gradient paths, whereas the MP module combines pooling and convolutional downsampling methods. In the neck, YOLOv7 uses the ELAN-W structure for multi-scale feature fusion and obtains different receptive fields by maximum pooling with the SPPCSPC module to adapt to images with different resolutions.
In this paper, a comprehensive set of metrics is used to evaluate the performance of tomato flower target detection. These metrics include precision (P), average precision (AP), recall (R), mean average precision (mAP) and F1 score. Precision is the most common evaluation metric and is defined as the number of correctly detected targets divided by the total number of detected targets. The higher the precision, the better the detection effect; however, relying solely on precision may not fully reflect the detection effect. Therefore, the mAP, recall, and F1 score are introduced for comprehensive evaluation. The calculation formula for the above indicators is as follows:
F1 score:
where
TP (true positive) denotes the number of correctly detected tomato flowers;
FP (false positive) denotes the number of other objects incorrectly detected as tomato flowers; and
FN (false negative) denotes the number of undetected or missed tomato flowers.
3. Results
3.1. Experimental Setup
This study chose the PyTorch deep learning framework, paired with Python 3.8 programming language, as the main platform for training and testing on desktop computers with a Windows 11 operating system and an Inter Core i9-12900K CPU processor. Considering GPU computing power, the NVIDIA GeForce GTX 3090Ti graphics card was selected and combined with CUDA 11.0, cuDNN, and OpenCV to achieve efficient image processing functions.
In the training phase, the YOLOv7 network input image size was fixed at 640 × 640 pixels. The model was trained for a total of 200 epochs, each with a BachSize of 32, and regularisation was performed each time through the BN layer to update the model weights. Stochastic gradient descent (SGD) was used as the optimisation function with a momentum factor of 0.937. The initial learning rate was set to 0.01, and the weight decay coefficient was 0.0005 to regulate the complexity of the model and prevent overfitting.
3.2. Analysis of Detection Results for Different Datasets
The mAP
@0.5 is the AP value of the tomato flower detector when the IoU threshold is set to 0.5. The Precision, mAP
@0.5 results for different datasets are shown in
Figure 8. It can be seen that the visible camera dataset (Camera) fluctuates a lot due to the in situ fixed-point shooting, although the accuracy continues to increase. Compared with the training results of the other two types of datasets, the accuracy increases the mostly slowly and the convergence speed is the lowest. The mobile phone dataset (Phone), which is random mobile acquisition, has more stable training results with less fluctuation, peaking after about 30 rounds of training. The mixed dataset (Mix) has the fastest convergence speed during the training process, with a gradual increase in accuracy and the least fluctuation. After mixing the images, the image diversity becomes richer, the complexity increases, and the model is able to learn more features.
Table 1 demonstrates the different training metrics for different datasets. The mixed dataset has the highest precision of 82.81%, which is 12.48 percentage points and 1.73 percentage points higher than the detection results of the other two types of datasets, respectively. The recall is 81.25%, which is an improvement of 12.1 percentage points and 1.03 percentage points, respectively. It can be seen that expanding the image data source and fusing the image data obtained in different ways can effectively improve the adaptability of the model in different environments, optimise the training effect and enhance the robustness of the model.
Figure 9 shows the detection results of training models on different datasets of different images. From the detection results of (a) and (b) in
Figure 9, it can be seen that the model misses more detection, and the detection effect is not satisfactory. The model has limited ability to adapt in different environments and conditions, and the generalisation ability is not strong. The results of training the model detection with the mixed dataset are shown in (c) and (d) in
Figure 9. It is found that it exhibits a high level of detection when facing interference conditions such as bright light, dimness or occlusion. The detection of small targets in the image is also more accurate. This model demonstrates strong adaptability and an excellent generalisation ability. It also proves that the indicators of the model, as well as the generalisation ability of the model, can be effectively improved by expanding the data source when the dataset is not sufficient.
3.3. Detection Performance Analysis of Different Datasets Combined with Amplification Methods
The training results of the visible camera dataset after balanced augmentation processing are shown in
Figure 10a, where initial-processing is an unsupervised image data enhancement method and deep-processing is a supervised image data enhancement method. It can be seen that the initial-processing method has faster convergence and more stable result changes. Zooming in on the accuracy change process from 40 to 150 rounds reveals that the training accuracy of the deep-processing method fluctuates more, with the accuracy difference reaching 0.38. (a) in
Table 2 shows the results of each index. Among them, the initial-processing method has the best training results. The detection precision increased from 70.33% to 77.29%—an increase of 6.96 percentage points—and the recall and mAP
@0.5 improved by 2.87% and 0.87%, respectively. The increase in precision implies that the model performs better in reducing misdetection, and the precision will increase significantly. On the other hand, there are still some small targets in the dataset that are difficult to detect; thus, the improvement of recall and mAP
@0.5 is not significant. The training effect after processing the images using the StarGANv2 network is not satisfactory, which may be related to the data collection method, as images taken at fixed points introduce more interfering information, and most of the images in the dataset mainly contain plant leaves, with tomato flowers only accounting for a very small proportion. In addition, the background information of the images is almost the same, so the images generated by the background transformation performed by the StarGANv2 network do not change much. Even the target clarity is not as good as the original image, resulting in an overall training effect that is not as good as the original image.
Figure 10b shows the variation of results after balanced amplification of the mobile phone dataset. Zooming in on the results of some of the training rounds shows that the detection accuracy of the source image is higher with less fluctuation, and that initial-processing is more stable than deep-processing accuracy change. This study suggests that this may be related to image information equalisation processing; unlike fixed-point shooting, mobile phone shooting focuses more on the target and usually places the target in the centre of the image, which may lead to incomplete information when performing image segmentation in different ways, as the target information is segmented into regional information. In addition, when LabelImg is used for labelling, the complete tomato flower information cannot be labelled, and the model cannot extract all the features, which ultimately leads to a decrease in the training effect. After the StarGANv2 network augmentation, the model metrics are slightly improved, as shown in
Figure 10b. The image generated by the StarGANv2 network does not destroy the integrity of the original image information, while retaining the key target information. As a result, the precision and recall of detection are improved. (b) in
Table 2 shows that the deep-processing method’s precision is improved by 1.29% and recall is improved by 4.09% compared to the initial-processing method. Compared to the training results of the source dataset, the precision is reduced by 2.88% and the recall is reduced by 4.07%. This is due to the fact that the images generated by the StarGANv2 network cannot exactly match the real-world images. The generated images are not as rich as real images in terms of edges, details and textures, and although the background of the source image is complex, it does not change much on the whole. This leads to a single form of background transformation and makes it difficult to cover a variety of complex background features in the real world.
In summary, the use of different balanced amplification methods has different effects on the training results due to different image acquisition methods. For visible light camera datasets, the supervised balanced amplification method is the best training effect. This method can accelerate convergence and stabilise the results, which significantly improves the detection accuracy, recall, and mAP@0.5 metrics. For mobile phone datasets, it is more desirable to use source images for training. This is due to the fact that mobile phone photography is more focused on the target, and the additional amplification method may introduce noise and affect the training effect of the model.
3.4. Evaluation of Mixed Data Performance under Different Data Sources Matching the Optimal Amplification Method
The supervised balanced augmented visible dataset (initial-processing) and the mobile phone dataset (Phone) were remixed to form a new dataset (New mix) for comparative analyses with the original mixed dataset (Mix).
Figure 11 shows the training results, where
Figure 11a shows the loss fitting curve and
Figure 11b shows the accuracy variation curve. As shown in
Figure 11a, all loss curves have converged, and no overfitting occurred during the training process. The loss rapidly decreases in the first 50 rounds, then gradually decreases between rounds 50 and 200. Ultimately, the loss of the New mix dataset is better than that of the Mix dataset. From
Figure 11b, it can be seen that the accuracy improvement rate of the New mix dataset is relatively fast, reaching the maximum value earlier and then entering the stable stage, with a relatively small range of accuracy fluctuations.
Table 3 shows the results of the training metrics for the two types of mixed datasets. It can be seen that New mix’s precision, recall, mAP
@0.5 and F1 score are all better than those of the Mix dataset, with an improvement of 0.78%, 0.2%, 1.2%, and 0.49%, respectively. Compared with the IC dataset, the accuracy has increased by 13.26 percentage points and the recall rate has increased by 12.3 percentage points.
The problem of occlusion of plant stalks or leaves is prevalent in images collected in the greenhouse, so the trained model must achieve as high an accuracy of detection as possible under occlusion conditions. In order to verify the performance of the model under different conditions, images under dim, high brightness, and occlusion conditions were selected for testing and the results are shown in
Figure 12.
The detection results in
Figure 12 show that the models trained on both types of datasets perform better in solving the complex environment of the plant. Comparison of
Figure 12a and
Figure 12d reveals that some of the targets are not detected due to more target blurring and occlusion, which are marked with red circles in
Figure 12. In contrast, the model trained by New mix performs better in the occlusion case and achieves more accurate detection.
Although it performs well in occlusion detection, the New mix trained model has the problem of over-detection, as shown in
Figure 13d–f. It can be seen that the model marks the same target multiple times. Compared to
Figure 13a–c, the model trained with Mix has more accurate box detection and no duplicate markers, which explains the different detection counts in
Table 3. Over-detection leads to redundancy of results, and too many duplicate markers not only increase the consumption of computational resources, but also may affect the subsequent processing and analysis, leading to an increase in the false alarm rate. Therefore, although the New mix trained model performs well in occlusion detection, it still needs to be optimised to address the over-detection problem.
In summary, the New mix trained model is able to identify and locate the occluded plant parts more accurately, showing its advantages when dealing with complex scenes. This efficient detection capability gives the new model a significant advantage in some specific application scenarios, providing more accurate data information.
4. Discussion
The geometric position-based image salient feature information equalisation and amplification method proposed in this study has achieved significant results in solving the problem of imbalanced sample size in tomato flower images. Especially after integrating visible light and mobile phone datasets (IC + MP), the model accuracy and recall have been improved, demonstrating the advantages of diverse data sources in model training. However, although the introduction of mobile phone images has increased the sample size, the characteristics of their specific target shooting may lead to insufficient representativeness of the data, which in turn affects the performance of the model in real environments. Therefore, the detection performance of the model still needs to be systematically validated in diverse and real agricultural environments. In addition, although the introduction of generative adversarial networks (GANs) aims to enhance the model’s generalisation ability through data augmentation, their effectiveness in this study is not ideal. This is mainly due to the small proportion of tomato flower target information in the captured images, resulting in a single image style and limiting the effectiveness of GAN in feature learning, thereby affecting the quality of the generated images and directly leading to the failure to significantly enhance model performance.
5. Conclusions and Future Work
In this study, we propose a balanced amplification method based on geometric position of significant feature information in images to solve the problem of imbalance in the number of image samples in each fertility period of tomato. On this basis, the effects of different data sources and amplification methods on the performance of the detection model are analysed. From the perspective of image datasets, the hybrid dataset (IC+MP), fusing visible and mobile phone image datasets, has the best training effect, with precision and recall of 82.81% and 81.25%. These results are about 12 percentage points higher than those of the separate in situ monitoring dataset, respectively. Therefore, on the basis of in situ sensing monitoring, it can be considered that expanding the data source by taking random photos from time to time can significantly improve the detection ability of the model. In terms of the image augmentation method, after supervised and balanced augmentation of the visible light dataset, the training precision of the model is improved from 70.33% to 77.29%, and the recall rate is improved from 69.15% to 72.02%. Compared to below, the original image training of the mobile phone dataset outperforms all augmentation methods. In the end, the hybrid dataset, after fusing the visible light dataset with supervised balanced augmentation and the mobile phone dataset, has the best training effect. Its loss curve converges faster, and the detection accuracy of the trained model is 83.59%, and the recall rate is 81.45%. The New mix dataset, through the introduction of a highly diversified sample set with a better training strategy, significantly enhances the recognition capability of the occluded targets and the detection capability of the small targets, and effectively reduces the common leakage and false detection phenomena in traditional methods. In addition, the performance of the modified model is more stable in complex scenes, which further enhances its reliability and practicality in practical applications.
The various factors in the real environment, such as lighting intensity, leaf occlusion, and background complexity, may have a significant impact on image quality, leading to instability in model performance. The combined effect of these factors may result in significant differences in model performance under different environmental conditions. In the future, this study should focus on improving the diversity and representativeness of the dataset, especially by collecting more samples under different environmental conditions to enhance the model’s generalisation ability. Meanwhile, by improving the GAN network architecture and training strategy, the quality of generated images can be enhanced, thereby improving the overall performance of the model.
Author Contributions
Writing—original draft, P.L.; conceptualisation, methodology, L.Z. and X.L.; data analysis and collection, J.X. and Y.L.; assembly of equipment, S.Z.; project management, funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.
Funding
National Natural Science Foundation of China (52379035), Excellent Youth Science Foundation of BAAFS (YXQN202304), Cultivation of major scientific and technological achievements of BAAFS, Beijing Nova Program (20230484375).
Institutional Review Board Statement
This study did not require ethical approval.
Informed Consent Statement
The authors declare no potential conflicts of interest or ethical problems relate to the data used.
Data Availability Statement
The datasets generated and analysed for this study are available by contacting the corresponding authors. The data are securely stored in our institutional repository and will be shared subject to ethical guidelines and data sharing policies.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Mavridou, E.; Vrochidou, E.; Papakostas, G.A.; Pachidis, T.; Kaburlasos, V.G. Machine Vision Systems in Precision Agriculture for Crop Farming. J. Imaging 2019, 5, 89. [Google Scholar] [CrossRef] [PubMed]
- Lu, Y.; Young, S. A survey of public datasets for computer vision tasks in precision agriculture. Comput. Electron. Agric. 2020, 178, 105760. [Google Scholar] [CrossRef]
- Liakos, K.; Busato, P.; Moshou, D.; Pearson, S.; Bochtis, D. Machine Learning in Agriculture: A Review. Sensors 2018, 18, 2674. [Google Scholar] [CrossRef] [PubMed]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Available online: http://arxiv.org/abs/1912.01703 (accessed on 5 July 2024).
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), Savannah, GA, USA, 2–4 November 2016. [Google Scholar]
- Li, J.; Wang, E.; Qiao, J.; Li, Y.; Li, L.; Yao, J.; Liao, G. Automatic rape flower cluster counting method based on low-cost labelling and UAV-RGB images. Plant Methods 2023, 19, 40. [Google Scholar] [CrossRef]
- Xu, C.; Lu, Y.; Jiang, H.; Liu, S.; Ma, Y.; Zhao, T. Counting Crowded Soybean Pods Based on Deformable Attention Recursive Feature Pyramid. Agronomy 2023, 13, 1507. [Google Scholar] [CrossRef]
- Jiang, Y.; Li, C. Convolutional Neural Networks for Image-Based High-Throughput Plant Phenotyping: A Review. Plant Phenomics 2020, 2020, 4152816. [Google Scholar] [CrossRef]
- Li, T.; Sun, M.; He, Q.; Zhang, G.; Shi, G.; Ding, X.; Lin, S. Tomato recognition and location algorithm based on improved YOLOv5. Comput. Electron. Agric. 2023, 208, 107759. [Google Scholar] [CrossRef]
- Milioto, A.; Lottes, P.; Stachniss, C. Real-time semantic segmentation of crop and weed for precision agriculture robots leveraging background knowledge in CNNs. IEEE Access 2018, 22, 2229–2235. [Google Scholar]
- Wong, S.C.; Gatt, A.; Stamatescu, V.; McDonnell, M.D. Understanding Data Augmentation for Classification: When to Warp? Available online: http://arxiv.org/abs/1609.08764 (accessed on 5 July 2024).
- Wang, Y.; Huang, G.; Song, S.; Pan, X.; Xia, Y.; Wu, C. Regularizing Deep Networks with Semantic Data Augmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3733–3748. [Google Scholar] [CrossRef]
- Zhang, J.; Rao, Y.; Man, C.; Jiang, Z.; Li, S. Identification of cucumber leaf diseases using deep learning and small sample size for agricultural Internet of Things. Int. J. Distrib. Sens. Netw. 2021, 17, 155014772110074. [Google Scholar] [CrossRef]
- Lu, Y.; Chen, D.; Olaniyi, E.; Huang, Y. Generative adversarial networks (GANs) for image augmentation in agriculture: A systematic review. Comput. Electron. Agric. 2022, 200, 107208. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, H.; Zhang, M.; Chen, Y. Reconstruction of kiwifruit fruit geometry using a CGAN trained on a synthetic dataset. Comput. Electron. Agric. 2020, 175, 105590. [Google Scholar] [CrossRef]
- Wang, H.; Zhang, Z.; Liu, Y.; Wang, L. High-resolution wheat images synthesis using generative adversarial networks for classification and detection. Comput. Electron. Agric. 2018, 154, 67–75. [Google Scholar] [CrossRef]
- Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. Robust Tomato Recognition for Robotic Harvesting Using Feature Images Fusion. Sensors 2016, 16, 173. [Google Scholar] [CrossRef]
- Rupanagudi, S.R.; Ranjani, B.S.; Nagaraj, P.; Bhat, V.G. A cost effective tomato maturity grading system using image processing for farmers. In Proceedings of the 2014 International Conference on Contemporary Computing and Informatics (IC3I), Mysore, India, 27–29 November 2014; pp. 7–12. [Google Scholar] [CrossRef]
- Yamamoto, K.; Guo, W.; Ninomiya, S. Node Detection and Internode Length Estimation of Tomato Seedlings Based on Image Analysis and Machine Learning. Sensors 2016, 16, 1044. [Google Scholar] [CrossRef]
- Liu, G.; Mao, S.; Kim, J.H. A Mature-Tomato Detection Algorithm Using Machine Learning and Color Analysis. Sensors 2019, 19, 2023. [Google Scholar] [CrossRef]
- Rong, J.; Zhou, H.; Zhang, F.; Yuan, T.; Wang, P. Tomato cluster detection and counting using improved YOLOv5 based on RGB-D fusion. Comput. Electron. Agric. 2023, 207, 107741. [Google Scholar] [CrossRef]
- Kang, R.; Huang, J.; Zhou, X.; Ren, N.; Sun, S. Toward Real Scenery: A Lightweight Tomato Growth Inspection Algorithm for Leaf Disease Detection and Fruit Counting. Plant Phenomics 2024, 6, 0174. [Google Scholar] [CrossRef]
- Borji, A. Pros and cons of GAN evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [Google Scholar] [CrossRef]
- Yaermaimaiti, Y.; Wang, R. Chinese character style transfer based on improved StarGAN v2 network. Int. J. Inf. Commun. Technol. 2024, 1, 10063507. [Google Scholar] [CrossRef]
Figure 1.
Tomato flower samples captured by different shooting methods: (a–c) are examples of images captured by visible light cameras; (d–f) are examples of capturing images with a mobile phone.
Figure 1.
Tomato flower samples captured by different shooting methods: (a–c) are examples of images captured by visible light cameras; (d–f) are examples of capturing images with a mobile phone.
Figure 2.
Dataset preparation and methodological process.
Figure 2.
Dataset preparation and methodological process.
Figure 3.
Geometric location-based equalisation method for salient feature information.
Figure 3.
Geometric location-based equalisation method for salient feature information.
Figure 4.
Geometric location-based image screening for salient features.
Figure 4.
Geometric location-based image screening for salient features.
Figure 5.
Geometric location-based salient feature information equalisation results.
Figure 5.
Geometric location-based salient feature information equalisation results.
Figure 6.
StarGANv2 model structure.
Figure 6.
StarGANv2 model structure.
Figure 7.
YOLOv7 model structure.
Figure 7.
YOLOv7 model structure.
Figure 8.
The results of different indicators of different dataset changes: (a) is the change of detection accuracy; (b) is a mAP@0.5 change.
Figure 8.
The results of different indicators of different dataset changes: (a) is the change of detection accuracy; (b) is a mAP@0.5 change.
Figure 9.
Detection results of models trained on different datasets on different images: (a) is the result of the model trained on the visible camera dataset on the mobile phone captured image; (b) is the result of the model trained on the mobile phone dataset on the visible camera captured image; and (c,d) are the detection results of the model trained on the mixed dataset.
Figure 9.
Detection results of models trained on different datasets on different images: (a) is the result of the model trained on the visible camera dataset on the mobile phone captured image; (b) is the result of the model trained on the mobile phone dataset on the visible camera captured image; and (c,d) are the detection results of the model trained on the mixed dataset.
Figure 10.
Training results after different balanced amplification methods for different datasets: (a) training results for visible camera dataset; (b) training results for mobile phone dataset.
Figure 10.
Training results after different balanced amplification methods for different datasets: (a) training results for visible camera dataset; (b) training results for mobile phone dataset.
Figure 11.
Training situation of Mix dataset and New mix dataset. (a) is the change in training loss, (b) is the change in training accuracy.
Figure 11.
Training situation of Mix dataset and New mix dataset. (a) is the change in training loss, (b) is the change in training accuracy.
Figure 12.
Actual detection results of tomato flowers: (a–d) are the Mix training model detection results; (e–h) are the New mix training model detection results.
Figure 12.
Actual detection results of tomato flowers: (a–d) are the Mix training model detection results; (e–h) are the New mix training model detection results.
Figure 13.
Comparison of the results of repeated detection of tomato flowers: (a–c) are the results of the Mix training model’s detection; (d–f) are the results of the New mix training model’s detection.
Figure 13.
Comparison of the results of repeated detection of tomato flowers: (a–c) are the results of the Mix training model’s detection; (d–f) are the results of the New mix training model’s detection.
Table 1.
Performance of trained models for different datasets.
Table 1.
Performance of trained models for different datasets.
Data | Precision (%) | Recall (%) | mAP@0.5 (%) | F1 Score (%) |
---|
Insitu-Camera (IC) | 70.33 | 69.15 | 73.04 | 69.73 |
Mobile-Phone (MP) | 81.08 | 80.22 | 80.63 | 80.68 |
Mix (IC + MP) | 82.81 | 81.25 | 81.26 | 82.02 |
Table 2.
(a) Performance of trained model after balanced augmentation of visible camera dataset (IC). (b) Performance of trained model after balanced augmentation of mobile phone dataset (MP).
Table 2.
(a) Performance of trained model after balanced augmentation of visible camera dataset (IC). (b) Performance of trained model after balanced augmentation of mobile phone dataset (MP).
(a) |
Amplification | Precision (%) | Recall (%) | mAP@0.5 (%) | F1 Score (%) |
Origin | 70.33 | 69.15 | 73.04 | 69.73 |
Initial-Processing (IP) | 77.29 | 72.02 | 73.91 | 76.48 |
Deep-Processing (DP) | 65.63 | 52.73 | 61.62 | 58.48 |
(b) |
Data | Precision (%) | Recall (%) | mAP@0.5 (%) | F1 Score (%) |
Origin | 81.08 | 80.22 | 80.63 | 80.68 |
Initial-Processing (IP) | 76.91 | 72.06 | 74.13 | 74.41 |
Deep-Processing (DP) | 78.20 | 76.15 | 74.08 | 77.16 |
Table 3.
Performance of trained models for two types of mixed datasets.
Table 3.
Performance of trained models for two types of mixed datasets.
Data | Evaluation Index | Numbers |
---|
Precision (%) | Recall (%) | mAP@0.5 (%) | F1 Score (%) |
---|
Mix (IC + MP) | 82.81 | 81.25 | 81.26 | 82.02 | 133 |
New mix (IC ∗ IP + MB) | 83.59 | 81.45 | 82.46 | 82.51 | 141 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).