1. Introduction
Meeting the demands of the expanding global population is an imperative undertaking that requires a substantial increase in food production. For this reason, plant breeding is essential to ensure food, economic, and environmental sustainability, as well as to contribute to human health and well-being. Nevertheless, achieving this production increase is a multifaceted endeavor, impeded by the depletion of natural resources, limited arable land availability, and the considerable variability in climate conditions, among other challenges. Consequently, innovative solutions, exemplified by the genomic selection (GS) methodology introduced by Meuwissen [
1], have become essential for genetic enhancements. These advancements are geared towards bolstering the stability of yields, elevating productivity, increasing resistance to diseases, and enhancing nutritional profiles and ultimate end-use quality across pivotal crops such as wheat, rice, maize, and various others [
2].
Genomic selection (GS) and prediction (GP) represent a groundbreaking paradigm shift in the realm of plant breeding [
3]. Nevertheless, the practical execution of GS remains a formidable task, as it does not consistently ensure highly accurate predictions [
4]. High prediction accuracies are essential for the successful implementation of genomic prediction in plant breeding for several crucial reasons: (1) efficient selection. Accurate predictions enable breeders to efficiently identify and select individuals with desired traits, such as a high yield, disease resistance, or nutritional quality. This streamlines the breeding process by reducing the need to grow and evaluate large numbers of plants. (2) Resource optimization. High-accuracy predictions help allocate limited resources such as land, labor, and financial investments more effectively. Breeders can focus their efforts on plants with the greatest potential for improvement, thus conserving resources. (3) Faster progress. When predictions are highly accurate, progress in breeding programs is accelerated. This allows for the development of improved crop varieties in a shorter timeframe, which is crucial to address food security and agricultural challenges [
5]. (4) Cost reduction. Accurate predictions reduce the costs associated with field trials, extensive phenotyping, and maintaining large breeding populations. This cost reduction can make breeding programs more economically viable. (5) Genetic gain. Higher prediction accuracies lead to greater genetic gains, meaning that the desirable traits are more rapidly and effectively incorporated into the breeding population [
5], which results in crops with improved characteristics. (6) Stability. Stable and reliable predictions minimize the risk of selecting plants with undesirable traits, which could set back breeding programs or lead to inferior varieties [
5]. (7) Confidence. High-accuracy predictions provide breeders with confidence in their selections, increasing the likelihood of success and the adoption of new varieties by farmers [
6].
Reaching a high prediction accuracy with GS is challenging due to genetic complexity, environmental variations, and limitations in data and resources. Complex traits often involve multiple genes, while environmental factors impact trait expression [
7,
8,
9,
10,
11]. Accurate phenotyping and marker data are crucial, and overfitting and the population structure can hinder accuracy. Ongoing research aims to improve models, marker densities, and data quality to enhance the precision of genomic predictions [
7,
8,
9,
10,
11]. For this reason, novel approaches are required to improve the prediction accuracy of GS.
In general, deep neural networks are trained to minimize error on the training dataset (minimizing practical risk), and these neural networks scale linearly with the size of the training set [
12]. However, learning theory says that the error in the training dataset is minimized when the size of the training set does not increase. On the other hand, Zhang [
12] pointed out that training sets used to minimize the error changed the outcome when evaluated in practical examples with trained sets outside the standard dimensional range. In practical terms, the conclusion is that minimizing the error on the training set alone does not explain the cases when the training data differ slightly from the testing distribution.
Zhang [
12] argued that for different sizes of training sets, the choice for training the training set is data augmentation (DA) that is based on the vicinal (neighborhood) minimization risk principle for each training set. Then, the virtual training sets can be obtained from the vicinity distribution of the training set. That is, for the optimization of the training set in GS, a vicinity distribution of the training set is required. However, Zhang [
12] pointed out that, although DA improves generalization, the method is highly data-dependent and does not model the vicinity relationships across different possible training populations. Based on this, Zhang [
8] presented a DA routine named
mixup that constructs different training set examples. Essentially, we need to find a function that describes the relationship of a random feature vector (X) and a random target vector (Y) to define a joint probability distribution P(X,Y).
mixup trains a neural network that regularizes the neural networks that favor simple linear regressions between diverse training sets.
Data augmentation (DA) is a novel technique that artificially increases the training set to improve the prediction performance. The training set is artificially expanded by applying various transformations on the existing data, such as rotations, flips, or cropping. This enhances model robustness, generalization, and performance by exposing it to a wider range of training examples, thus improving its ability to handle real-world variations and noise [
13,
14,
15,
16]. Some successful applications of data augmentation are (1) image classification. DA is widely used in image classification tasks, such as recognizing objects or animals, by creating variations of images with different angles, lighting, and perspectives [
17]. (2) Natural language processing (NLP). In NLP, DA techniques such as synonym replacement, paraphrasing, and text generation are applied to expand text datasets, improving the performance of models in tasks such as sentiment analysis and text summarization [
18]. (3) Speech recognition. DA is used in speech recognition by altering audio samples with noise, speed variation, or pitch shifts, making models more robust to different speaking styles and environments [
19]. (4) In the context of tabular data, DA involves generating synthetic data points by slightly perturbing or interpolating existing data entries [
13]. For example, in financial fraud detection, one can augment a dataset of credit card transactions by creating new instances with slightly modified transaction amounts or timestamps to help train a model to detect fraudulent activities more effectively.
Regarding the average gain in prediction performance, using DA compared to not using has different effects, depending on the specific dataset, task, and augmentation techniques used. In general, DA can lead to notable improvements in prediction performance, particularly when the original dataset is limited or when the task involves recognizing patterns in noisy or diverse data. The degree of improvement can range from a few percentage points to substantial enhancements, making DA a valuable tool to enhance the robustness and generalization of machine learning models. However, the extent of the gain will depend on factors such as the quality of the augmentation techniques, the complexity of the task, and the size and quality of the initial dataset [
20].
Due to these factors, plant breeding can greatly benefit from the strategic use of DA to enhance the accuracy of genomic prediction models. By augmenting limited datasets with variations and synthetic examples, breeders can inject diversity and representation into their training data. This approach can allow models to capture a broader spectrum of genetic and phenotypic traits, leading to more robust and accurate predictions of plant performance. Furthermore, DA is of paramount importance to mitigate overfitting and reduces the risk of models learning from rare, biased, or unrepresentative samples [
13,
14,
15,
16]. In the context of genomic selection, where the availability of large-scale genomic data is often limited, DA can be a valuable tool to maximize the utility of existing data. It can empower breeders to make more informed decisions, accelerate the breeding cycle, and ultimately contribute to the development of improved plant varieties with higher yields, better disease resistance, and enhanced adaptability to changing environmental conditions.
The significance of delving into DA within the context of plant breeding cannot be overstated. This research endeavor aims to harness the potential of data augmentation to elevate predictive performance, crucial to the successful adoption of genomic selection (GS) methodologies. The practical implementation of GS remains a formidable challenge, as it does not always guarantee consistent high-quality predictions. Data augmentation techniques present an effective remedy, offering a pragmatic means to bolster the prediction accuracy. By generating synthetic data points that expand the training dataset, DA introduces vital diversity and enriches the representation of genetic variations. In an era where genomics plays an increasingly pivotal role in modern agriculture and breeding endeavors, embracing DA holds the promise of uncovering novel insights, expediting breeding cycles, and fueling advancements in crop improvement. Ultimately, this pursuit significantly contributes to the overarching goals of global food security and sustainable agricultural practices.
4. Discussion
The perspective on the use of data augmentation in the realm of machine learning and data analysis has evolved significantly in recent years. Originally seen as a simple technique to artificially increase the size of training datasets, data augmentation has now emerged as a crucial tool to improve model generalization and performance. Rather than just a means of mitigating overfitting, it is increasingly regarded as a strategy to enhance the robustness and adaptability of models. This perspective shift stems from the realization that data augmentation not only introduces diversity into the training data but also enables models to learn more invariant and meaningful features from the augmented samples. As a result, data augmentation is now viewed as an integral component of the deep learning pipeline, playing a pivotal role in improving the real-world applicability and reliability of machine learning models across various domains, from computer vision to natural language processing.
Our results show that the strategy of data augmentation led to a decrease (worse performance) in the prediction accuracy of the whole testing set by 48.6% and 38.9% in terms of the NRMSE and MAAPE, respectively. It is important to note that in this study, data augmentation was only carried out on the top 20% of lines in the training set, and the training set used was this top 20% of the training plus the corresponding augmented data. However, when we observe the prediction performance on the top 20% of the testing set, we can observe that data augmentation helps to significantly increase the prediction performance of the top lines by 108.4% in the NRMSE and 107.4% in the MAAPE across traits and datasets.
These results on the use of data augmentation for genomic prediction are promising, demonstrating the potential to revolutionize the field of plant breeding. Genomic prediction relies heavily on the quality and quantity of training data, and data augmentation offers a powerful approach to enhance both aspects. By generating synthetic data points and introducing diversity into the training dataset, data augmentation enables models to capture a wider range of genetic and phenotypic variations, leading to more robust and accurate predictions. Moreover, it mitigates issues such as data scarcity and imbalances, which are common in genomics. For this reason, the expanded and enriched datasets significantly improve the generalization and reliability of genomic prediction models. In an era where precision breeding is essential to address global food security and sustainability challenges, data augmentation, according to our results, is a very promising tool to accelerate progress, drive innovation, and unlock the full potential of genomics in plant breeding.
However, its implementation is challenging since if data augmentation is carried out many times without a particular goal in mind, instead of helping to improve the prediction accuracy, it can be harmful, as observed when the performance on the whole testing set was evaluated. However, because our goal was to improve the prediction of the top lines (the more productive lines were evaluated), the training dataset consisted of only the 20% of the best lines in the training set plus the fully augmented data of this top 20% of lines. This means that overall, while data augmentation offers immense potential for the improvement of model performance and generalization, it requires careful planning, domain expertise, and quality control to ensure its successful implementation without introducing unintended issues or biases. It is crucial to emphasize that the data augmentation (DA) approach is applicable not only to the Bayesian GBLUP model but also to various other statistical machine learning models. However, implementing it optimally with other algorithms requires further research. In our study, we exclusively utilized prediction error metrics such as the mean squared error (MSE) and the mean absolute percentage error (MAAPE). Notably, we did not observe improvements in terms of Pearson’s correlation coefficient. Consequently, we encourage additional research that employs data augmentation to fine-tune more effectively, aiming not only to reduce prediction errors but also to enhance Pearson’s correlation.
One inherent limitation in our approach lies in the exclusive augmentation of the top lines within each environment during training, focusing solely on these augmented observations for the final training phase. Consequently, our data augmentation strategy disproportionately underscores the importance of these top lines, leading to a reduction in prediction errors specifically for them in the testing set. Despite this targeted improvement, our augmentation strategy falls short of optimizing the overall performance, as it fails to effectively mitigate prediction errors across the entirety of the dataset.
Furthermore, despite significant reductions in prediction errors observed for the top lines in the testing set, no corresponding enhancement was noted in terms of Pearson’s correlation for either the entire testing set or its top lines. Consequently, we advocate for further research employing data augmentation (DA) in the context of genomic prediction. The proposed approach is not deemed optimal, and the question of whether the effectiveness of data augmentation can generalize across various crops, traits, and genetic backgrounds remains unanswered. The inherent variability in genomic data may impact the suitability and results of DA techniques. Additionally, the utilization of DA, particularly within the realm of genomic selection (GS), necessitates thoughtful consideration of the synthetic data-generation methods employed. The intricacy of these methods and the requirement for domain-specific expertise to effectively apply them may constrain the accessibility and uptake of DA within GS.
We attribute the absence of an improvement in the Pearson’s correlation metrics to a phenomenon known as range restriction [
24]. This occurs when the computation of metrics, such as Pearson’s correlation, is based on a restricted sample rather than the entire dataset. Consequently, we advocate for further investigation into how to fully leverage data augmentation techniques within the context of genomic prediction.
In this application, we used data augmentation techniques to enhance both the response variable (as defined in Equation (3)) and the input features (referred to as markers, as defined in Equation (2)). Specifically, we used the
mixup method, as detailed in the
Section 2. It is worth noting that numerous data augmentation methods exist; however, not all of them are suitable for tabular data, which is commonly encountered in the context of GS data. In this study, we focused exclusively on the
mixup method, leaving ample room for the future exploration of alternative techniques and methods to fine-tune the data generated using the
mixup approach. It is essential to emphasize the importance of a thoughtful and well-considered implementation of data augmentation techniques. This is critical to leverage the potential benefits of such methods, as there is a growing body of empirical evidence suggesting that data augmentation can significantly enhance model performance, mitigate data scarcity issues, and improve generalization, and it is continuing to evolve as a valuable tool in the toolkit of machine learning practitioners.
In general, our results provide empirical evidence that data augmentation techniques are promising tools to generate synthetic data that offer a multitude of advantages and hold significant potential across a wide spectrum of applications. Some of these advantages are (1) enhanced data privacy and security. Synthetic data generation empowers organizations to construct realistic and representative datasets without compromising the confidentiality of sensitive or private information. (2) Scalability. The process of generating synthetic data is scalable and does not require the arduous collection and manual labeling of extensive real-world datasets. (3) Data diversity. Data augmentation techniques can generate diverse data samples that encompass various scenarios and edge cases, which may prove challenging to capture through real-world data collection. (4) Mitigating data imbalances. Synthetic data generation can effectively address imbalances in datasets by generating additional samples for minority classes, thereby enhancing the overall performance of machine learning models. (5) Accelerated research. In the realm of research and experimentation, synthetic data can speed up prototyping and hypothesis testing, enabling researchers to explore novel concepts and iterate rapidly. In conclusion, the application of data augmentation to generate synthetic data stands as a promising avenue with far-reaching benefits for data-driven endeavors [
13,
14,
15,
16].