1. Introduction
Plant breeding is essential for maintaining a stable food supply to meet the increasing global food demand. To address this challenge, it is vital to adopt innovative methods that promote rapid genetic improvements and enhance agricultural productivity, particularly in the face of climate change. Traditional breeding methods, which rely on labor-intensive hybridization and selection processes, have limitations that have generated significant interest in genomic selection (GS) for crop breeding [
1,
2]. GS enhances genetic gains by reducing breeding cycles and optimizing resource use. Its successful application in livestock breeding has encouraged plant breeders to implement GS for predicting inbred performance, aiding parental selection, and forecasting hybrid performance [
3,
4]. For these reasons, GS is revolutionizing plant breeding programs by offering significant advantages in terms of accuracy, efficiency, and enabling more precise and effective breeding strategies. This transformative approach enhances yield, improves quality, and boosts resilience to environmental challenges [
4,
5]. Countries adopting GS are poised to enhance their food sovereignty by improving productivity, sustainability, and resilience in food production systems while autonomously managing and conserving genetic resources [
4,
5].
GS aims to merge comprehensive genotypic and phenotypic data from a training population to develop predictive models [
5]. These models are used to estimate genetic values and select individuals within a breeding population based on their genotype data. This method circumvents extensive testing, thereby avoiding biases in marker effect estimates and speeding up the breeding process.
The similarity between training and breeding populations is crucial for accurate predictions [
5]. Higher accuracies are achieved when the training population closely resembles the breeding population. In contrast, greater genetic distances between the two populations result in rapid decreases in accuracy [
3,
6]. The optimal size of a training population depends on relatedness, trait heritability, and population structure. Smaller training populations are ideal for closely related groups, while larger ones are necessary for more distantly related populations [
7].
Accurate prediction is key to the successful implementation of GS, as it enables breeders to select individuals with desirable traits for future breeding cycles, thereby increasing genetic gain per cycle compared to marker-assisted selection [
8]. Predictability, representing prediction accuracy, has been assessed in crops like maize, wheat, and barley through cross-validation [
9,
10]. These studies have shown that predictability is influenced by heritability, relatedness, sample size, marker density, and genetic architecture. Generally, predictability increases with higher marker density and larger sample sizes until it plateaus. The relatedness between training and breeding populations also significantly impacts predictability.
Given the complexities of genetics, environmental variations, and data limitations, new methods are needed to improve prediction accuracy in GS. Accurate phenotyping and marker data are essential to minimize prediction errors. Optimizing GS methodology is challenging, as some factors require increased resources while others do not benefit from such increases. However, statistical machine learning models offer a promising area for optimization [
11]. Studies comparing various GS models have shown that no single model is best for all traits, with prediction accuracy depending on the number of genes controlling the trait, allele effect distribution, presence of epistasis, and heritability [
12]. Bayesian methods are popular in genomic prediction because they can incorporate prior knowledge, handle high-dimensional and correlated data, and provide a probabilistic interpretation of predictions. This flexibility allows for more accurate and robust predictions, accommodating the complex genetic architecture and uncertainty inherent in genomic data. Among the numerous Bayesian methods available, such as BayesA, BayesB, BayesC, and Bayesian Lasso, Bayesian GBLUP is widely used for its robustness and computational flexibility, but in recent years, we have seen the emergence of deep learning models in genomic prediction.
Deep learning models can sometimes achieve higher prediction accuracy by learning directly from raw data, such as text, sound, and images. These models use large, labeled datasets to automatically extract features, eliminating the need for manual feature extraction typical of traditional machine learning. However, substantial and effective data are needed to prevent overfitting, especially when data are limited. Developing effective deep learning models requires minimizing errors within the training set, which impacts the validation set. Data augmentation (DA) is a powerful technique used to reduce training set errors and combat model overfitting [
13,
14]. DA involves artificially increasing the size of the training dataset through techniques like data warping or oversampling, thereby enhancing the generalizability and overall performance of training models.
Data augmentation techniques such as flipping, rotating, and cropping have proven to be successful in image classification, speech recognition, and natural language processing. The average gain in prediction performance from DA depends on the specific dataset and augmentation techniques used. In genomic selection for plant breeding, especially with limited large-scale training and genomic data, DA can maximize the utility of existing data. For example, Enkvetchakul and Surinta [
15] developed a plant disease recognition system using deep convolutional neural networks, achieving higher accuracy by combining offline training with data augmentation techniques. Chergui [
16] evaluated five regression models using three datasets (primary, with additional features, and augmented), finding that cross-validation showed an overall performance increase with augmented data.
The challenges in implementing GS due to insufficiently accurate predictions make data augmentation a promising solution to enhance predictive performance. This research aims to leverage data augmentation algorithms to improve prediction accuracy, which is essential for the successful adoption of GS methodologies in plant breeding.
4. Discussion
Data augmentation techniques serve as foundational tools for improving prediction accuracy within genomic selection (GS) for plant breeding. The essence of data augmentation lies in its ability to artificially expand the training dataset by generating additional samples through various transformations or perturbations. These augmented data samples introduce diversity into the training process, thereby enriching the learning experience for predictive models. By addressing the inherent limitations associated with relatively small or constrained datasets in practical GS applications, data augmentation plays a crucial role in enhancing the robustness and effectiveness of predictive models. Its significance extends beyond plant breeding to various domains of machine learning, where the augmentation of training data has proven to be instrumental in improving model generalizability and performance.
4.1. Application of Mixup Method
Our study focuses on the application of the mixup method [
18], which stands out for its effectiveness in stabilizing model predictions and improving generalization capabilities. Mixup operates by blending pairs of training samples and their corresponding labels in a linear interpolation manner, thereby generating synthetic data points lying along the line segment connecting the original samples. Applied across diverse maize and soybean datasets, the mixup method demonstrates promising results in enhancing prediction performance. Specifically, our analysis reveals significant improvements, particularly evident in the top-performing lines across all the datasets examined. This underscores the potential of mixup as a valuable augmentation technique for enhancing predictive capabilities within genomic prediction frameworks. It is important to point out that we augmented only a portion of the training data since augmenting the whole training set produced worst results in terms of prediction performance.
4.2. Emphasis on Top Lines and Restricted Augmentation
A distinctive aspect of our approach involves the deliberate emphasis on augmenting the top-performing lines during the training phase. This strategic decision stems from the recognition of the disproportionate impact that these lines often have on overall model performance, particularly in the context of plant breeding programs where the focus is often on elite or high-yielding genotypes [
26]. By targeting augmentation efforts towards these top lines, our aim is to mitigate prediction errors specific to this subset, thereby potentially improving the overall predictive performance of the model. While our findings indeed demonstrate a notable reduction in errors within this targeted subset, it is essential to acknowledge the inherent limitations of this approach. Despite its effectiveness in optimizing performance for select lines, its impact on the overall model performance across the entire testing set may be somewhat constrained [
22]. Therefore, further refinement and optimization are warranted to comprehensively address this discrepancy and maximize the utility of data augmentation strategies within genomic prediction frameworks.
4.3. Importance of Variability in Data Augmentation Strategies
The introduction of variability through data augmentation is a fundamental aspect that underpins its efficacy in genomic prediction. Genetic variation is a hallmark of biological systems, and capturing this variability is essential for developing predictive models that can effectively generalize across diverse genetic backgrounds. Synthetic data generated through augmentation techniques facilitate the representation of a broader spectrum of genetic variations, thereby enhancing the model’s ability to adapt to novel or unseen genotypes. By mitigating the risk of overfitting and expanding the representation of genetic diversity within the training dataset, variability emerges as a pivotal element in augmenting predictive capabilities within genomic selection frameworks. Therefore, careful consideration and integration of variability into the design and implementation of data augmentation strategies are essential to ensure optimal performance and generalizability of predictive models.
4.4. Cautionary Note on Data Augmentation Strategies
While data augmentation techniques offer significant benefits in enhancing prediction accuracy, it is crucial to exercise caution in their implementation. The effectiveness of data augmentation is contingent upon various factors, including the selection of appropriate augmentation techniques, the tuning of hyperparameters, and the characteristics of the dataset itself. The improper selection or application of augmentation techniques can lead to unintended consequences, such as model overfitting or degradation in predictive performance. Therefore, a thorough understanding of the underlying principles of data augmentation, coupled with careful experimentation and validation, is necessary to ensure the robustness and reliability of predictive models in real-world applications. Additionally, ongoing research and exploration are needed to further refine and optimize data augmentation methodologies, thereby maximizing their potential benefits while mitigating potential risks and challenges. Also, DA should be of interest in the context of genomic prediction for classification, but here it was not explored.