1. Introduction
The world confronts several challenges for satisfying the increased demands to feed the growing human population, which is projected to grow close to 10 billion by 2050 [
1,
2]; however, not only the population is increasing but also the natural resources (e.g., forest, soil, land, and water availability, etc.) have been drastically affected due to environmental problems such as deforestation and land degradation [
1]. In addition, in agriculture, the elite genotypes (high yield performance) have been negatively impacted due to the more often and more intense environmental perturbances. To guarantee the food requirements and confront these challenges, the new varieties yet to develop (in the very near future) will require to come up with a better resilience for a wide range of adaptation [
3]. For this, new strategies and methodologies for selecting genotypes to face these environmental challenges [
4] with high yield potential should be developed.
Breeders have implemented traditional breeding methods for selecting the best phenotypes to increase genetic gains [
5,
6]. However, phenotyping all genotypes in a wide range of environmental conditions is challenging because it requires a large number of plots in field experiments, it is also time-consuming of manual labor, and in general, there is a reduced availability of sources such as land, water, and seed. A more elaborated traditional breeding method considers the use of the pedigree information derived from the genetic relationship between the genotypes in the population [
7,
8,
9]. Pedigree based selection has been successful delivering predictions of the estimated breeding values of unobserved genotypes; however, its implementation presents some challenges [
10]. For example, it requires to keep track of the genetic relationships between all genotypes in training and testing sets. Also, it does not account for the Mendelian segregation within populations from a pair of genotypes limiting the rates of genetic progress that can be accomplished in a given period of time [
11].
Hence, the traditional selection methods based on phenotypic and pedigree information may not be the most suitable options for increasing genetic gains in short periods of time. Specially, because it is not easy to a priori estimate the recombination amount of the genome that comes from each of the parental lines [
6] complicating the selection process. The development of modern sequencing technologies offered the opportunity of characterizing genotypes based on their genomic information [
11]. A widely used alternative to the traditional selection methods is the Marker Assisted Selection (MAS) which uses genomic information [
12]. It considers a reduced set of influential molecular markers also known as quantitative trait loci (QTL) [
13,
14] to assist during the selection process. The main objective of MAS is to help to select the best candidate genotypes best candidate genotypes by predicting their phenotypic performance using the most influential genomic variants. Furthermore, this methodology has demonstrated to be more effective than the pedigree-based selection method [
12,
13,
14]. However, this method also presents some limitations specially when the traits are controlled by a large number of genes with small effects (complex traits) such as yield [
15] limiting/reducing the accuracy of the selection.
To overcome the limitations of MAS, the implementation of an emergent methodology called GS became popular in the last decade in plant and animal breeding applications across species and traits. Conceptually, this method uses the information on all available molecular markers for selection purposes [
11]. This methodology was first proposed by Bernardo [
16] and later on Meuwissen [
17] introduced a new framework to confront the challenge of dealing with a large set of markers (
p) and a reduced number of phenotypic records (
n) available for model fitting.
GS enables the prediction of the performance of genotypes at the early stages of the breeding programs using abundant molecular marker information of new/untested genotypes and a relative small number of genotypes with phenotypic and genomic information for model calibration. Such that, the predicted values can be used for selecting the best candidate genotypes that would perform well on advanced phenotyping stages [
18]. Another advantage of GS is that breeders can reduce phenotyping costs by employing predictions as surrogates of phenotypes. At beginning, GS was used in plant breeding for performing within-environments predictions only [
18,
19,
20,
21]. In general, breeders establish extensive field experiments for testing new cultivars in a wide range of environmental conditions and release stable genotypes that outperforms current elite cultivars [
4]. However, usually different response patterns are observed when same genotypes are observed in different environments showing a change in the relative ranking from one environment to another complicating the selection process [
22]. The occurrence of a change in the response patterns is also known as the presence of the genotype-by-environment interaction (G×E).
Several studies highlighted the impacts of accounting for the G×E in GS models [
23,
24,
25] when performing predictions in multi environments. Several applications have been developed to conduct the predictions of genotypes in single and multi-environments [
26,
27,
28,
29,
30,
31,
32]. However, to our knowledge, no comprehensive implementations/examples showing how the genomic prediction pipelines are built have been released. Among the tasks that a GS pipeline considers we have the implementation of quality control on genomic data, the assignation of training and testing sets for different cross-validation schemes, the construction of the different linear predictors, and the model fitting.
The main objective of this study is to provide an example of how a genomic prediction pipeline is built when considering different cross-validation schemes while preserving comparable sample sizes in training and testing sets. For this, we considered two soybean data sets. The first one (D1) corresponds to a sample of the USDA soybean collection with information on 324 genotypes tested in 6 environments (not all genotypes tested in all environments) and 9 traits. The second dataset (D2) corresponds to a sample of the SoyNAM experiment with information on 324 genotypes tested in 6 environments and 6 traits (all genotypes scored for all traits in all environments). The pipeline was built considering elemental modules that perform simple tasks and their implementation is controlled by changes in a parameter input file. Also, the outputs of the early stages of the pipeline become the input of more advanced stages allowing the assemble of complex structures in an easy way. Potentially, users will be able to easily modify and adapt the proposed pipeline to conduct their own data analyses (data sets for a desired set of parameters).
2. Materials and Methods
2.1. Phenotypic and Genomic Data
In this research, two different soybean datasets were used to show how the pipeline is implemented and these correspond to a sample of the USDA soybean collection, and a sample of the SoyNAM experiment.
Data set 1 (D1). Sample of the USDA soybean collection.
The USDA soybean collection is comprised of 14,430 genotypes that were collected in many locations around the world and observed in 4 different locations (States; Illinois, Kentucky, Minnesota, and Missouri) in the USA from 1963 to 2003. Not all the genotypes were observed in all the location-by-year combinations (environments). Further details of the USDA soybean collection can be found in Bandillo [
33]. The evaluation of the soybean genotypes in the US locations was gradually conducted. For this reason, the connectivity rate of genotypes across environments is very low. In this study, conveniently we selected a reduced set of genotypes (324) that were observed in 6 environments (MN945, IL945, IL0102, MS989, MS2000_2, and MN0102) and showed moderate levels of connectivity. We selected genotypes that were observed in at least 2 environments and with complete information on all 9 traits (grain yield, plant height in centimeters, lodging 1–5, days to physiological maturity - DysToR8, oil content, protein content, seed weight of 100 seeds, early shattering 1–5, and stem term score 1–5).
Figure 1 illustrates the levels of connectivity of the genotypes across environments for this data set (D1). Out of the 100% of the total (324 × 6 = 1944) potential cells (all genotypes observed in all environments) only 33.6% (654) of these combinations were observed (vertical gray lines in
Figure 1) in fields.
Data set 2 (D2). Sample of the SoyNAM project.
A random sample of the SoyNAM project was selected for the second dataset. The SoyNAM project is comprised of 40 biparental populations (140 individuals per population) sharing a common hub parent (IA3023) crossed with elite parents (17), plant introductions (8), and parents with exotic ancestry (15). [
34,
35] provide a detailed description of the SoyNAM data set. Briefly, the resulting 5400 accessions derived of the 40 biparental populations were observed in 18 location × year combinations (environments). In our case, conveniently we selected a random sample of 324 genotypes observed in 6 environments (IA_2012, IL_2011, IL_2012, IN_2012, NE_2011, and NE_2012) and measured for 6 traits (yield, moisture, protein content, oil content, fiber, and seed size). In this case, all the 324 genotypes were observed in all the 6 environments and were scored for all 6 traits.
2.2. Models
The main objective of this research is to provide a genomic prediction pipeline easy to adapt to different realistic scenarios of interest in breeding programs. Here, a set of 3 models was considered to show how to compose the linear predictors to use in different cross-validation schemes (4). Alternative models can be obtained by changing the assumptions of the model terms. Later we describe how to perform these changes in an easy manner.
2.2.1. M1. E + L
Consider that
represents the performance of the
ith genotype observed in the
jth environment for a given trait (e.g., grain yield) and it can be described as the sum of a constant common effect across lines and environments
, a fixed effect due to the environmental stimuli
corresponding to the
jth environment, a random effect
corresponding to the
ith line such that
, and a random effect
capturing the non-explained variability by the previous model terms with
. Collecting the previous assumptions, we have that the linear predictor becomes
One disadvantage of this model is that it does not allow the borrowing of information between genotypes complicating the prediction of the untested materials. To overcome this issue, the genomic information of the individuals in training and testing sets can be leveraged together with the phenotypic data from the observed genotypes (training set) to predict un-phenotyped individuals. Details of this approach are provided in model M2.
2.2.2. M2. E + L + G
In the previous model M1, the
term is used to describe the effect of the
ith genotype and it relies on phenotypic information only. Now, consider that this term can be also described by a linear combination between
p molecular markers and their corresponding marker effects such as
, where
corresponds to the marker effect of the
kth SNP
. When the number of molecular markers (
p) surpass the number of data points (
n) available for model fitting, it is impossible to obtain a unique solution for the marker effects because it involves the inversion of non-full rank matrices. In these cases, further assumptions about the marker effects should be considered under the statistical framework. Several alternatives have been proposed to overcome this issue and some of these are based on penalized regressions (Ridge Regression, LASSO, ELASTICNET, etc.) and Bayesian approaches (Bayesian Ridge Regression, Bayesian LASSO, BayesA, BayesB, etc.) Meuwissen [
17] proposed a set of models for those cases where the number of genomic variants (
p) was larger than the number of data points (
n) available for model fitting. A compressive review of the available genomic models to deal with this issue can be found in [
11].
In our case, the marker effects were considered independent and identically distributed (IID) outcomes from a normal distribution centered on zero with a common variance
[
36,
37] such that
. From results of the multivariate normal distribution, the vector of genomic effects
where
,
is the standardized matrix (by columns) of marker SNPs and
is the corresponding variance component. To avoid model miss specification due to imperfect genomic data the
term is also included in the model together with the genomic effect
. Considering the previous assumptions, we have that the resulting model becomes
An advantage of this model is that it allows the borrowing of information between tested and untested genotypes permitting the prediction of materials yet to be observed. However, a disadvantage of this model is that across environments it returns the same genomic value
for the
ith genotype. To allow specific genomic values of genotypes in environments, the reaction norm model [
25] was also implemented. This model decomposes the genomic effect as the sum of a common effect (intercept) of the genotypes across environments plus specific effects (slopes) for each environment. Further details of this model are provided next.
2.2.3. M3. E + L + G + G×E
Consider the inclusion of the
model term to describe the specific response of the
ith genotype in the
jth environment
. Jarquin [
25] proposed to model the vector of genomic effects in interaction with environments via co-variance structures as
, where
and
are the corresponding incidence matrices that connect phenotypes with genotypes and environments, respectively,
represents the cell-by-cell product between two matrices also know as Hadamard or Shur product, and
is the corresponding variance component. The resulting linear predictor is
2.3. Cross-Validation Schemes
To assess the ability of the different prediction models for delivering accurate results, four prediction scenarios that are of interest for breeders were considered. These prediction scenarios attempt to mimic different realistic prediction problems that breeders might face [
38] at different stages of the breeding pipeline.
Figure 2 presents the four different cross-validation scenarios using as example a hypothetical population of 48 genotypes to be observed in 6 environments. The different colors (vertical lines) correspond to a fivefold assignation of either phenotypes (CV2, and CV0) or genotypes (CV1, and CV00).
CV2 (tested genotypes in observed environments), corresponds to the scenario of predicting incomplete field trials where some genotypes have been observed in some environments but not in others. In this case, the genotypes of interest are probably observed in other environments and also other genotypes have been already observed in the environment(s) of interest. A random fivefold assignation, represented with different colors (black, gray, red, yellow and blue) in the top left panel of
Figure 2, was considered. Here the phenotypic data was randomly assigned to each one of the five folds (colors) maintaining folds of similar size (~20% of the observations). Then four folds are employed for model calibration when predicting the remaining fold, and this procedure is sequentially repeated for all five folds (one at a time).
CV1 (untested genotypes in observed environments), mimics the scenario of predicting genotypes that have not been observed yet at any of the environments and the goal is to predict the performance of these genotypes in environments where other genotypes were already observed. Here, a fivefold cross-validation was implemented by assigning around 20% of the genotypes to folds (bottom left panel in
Figure 2) such that all the phenotypic records of a genotype are assigned to the same fold (color) avoiding to encounter phenotypes of the same genotype in different folds. In the bottom left panel in
Figure 2, across environments (horizontal lines) the phenotypes of the 48 genotypes have the same color, and those genotypes with the same color belong to the same fold. Same as before, four folds are considered for model training when predicting the remaining fold. This prediction procedure is sequentially repeated for each one of the five folds (one at a time).
CV0 (tested genotypes in unobserved environments), represents the prediction scenario of predicting the mean performance of genotypes in hypothetical unobserved environments. It considers phenotypic information of same and from other genotypes observed in other environments (training set). In this case, the conventional prediction procedure consists of leaving one environment out and then use the remaining environments for model calibration when predicting the excluded environment. This procedure is sequentially repeated for each environment (one at a time). However, in our case, we introduced an alternative way to conduct the prediction of unobserved environments in an attempt for preserving similar sample sizes for training and testing sets than in previous schemes. In this way, it is possible to compare the results of the different cross-validation scenarios with similar sample sizes. The top right panel of
Figure 2 illustrates an example that considers the prediction of the genotypes in gray color (horizontal lines) in environment 3. In this case, there is information available of the same genotype but observed in the remaining 5 environments. Here, the same fold assignation as in the CV2 was such that it is possible to conduct a direct comparison of the accomplished predictive ability between these two cross-validation scenarios. The prediction procedure consists of sequentially predicting each one of the five folds in each environment (one at a time). This procedure is repeated for each environment (one at a time).
CV00 (untested genotypes in unobserved environments), corresponds to the case of predicting new genotypes in novel environments. The conventional method for predicting untested genotypes in unobserved environments consist of discarding from the training set the phenotypic records of those genotypes in the target environment (testing set), then predict the performance of those genotypes in the novel environment using the phenotypic information available in the calibration environments. However, this procedure poses extra challenges as the number of common genotypes increases across environments. In our hypothetical example in
Figure 2, all the genotypes were observed in all environments. Thus, if the phenotypic information for those genotypes in the target environment is discarded across environments, no phenotypic records will be available for model calibration. However, with the proposed scheme (bottom right panel in
Figure 2) only the phenotypic information of those genotypes in the gray fold is deleted across environments. Then the information of the remaining folds (colors) in the other environments is used for model calibration when predicting the gray fold in environment 3. The same procedure is repeated for the remaining four folds (one at a time) in environment 3. This same procedure is repeated for the other environments (one at a time).
2.4. Assessment of Predictive Ability within and across Environments
For all cross-validation schemes, the predictive ability was assessed as the within environments correlation between predicted and observed values. The average correlation across environments was computed according to [
39] for accounting for uncertainty and the sample size of the environments as
where
is the Pearson’s correlation between predicted and observed values at the
ith environment,
corresponds to the sampling variance and
is the number of observations at the
ith environment.
2.5. Variance Components
To assess the relative importance of the different model terms on each one of the prediction models, a full data analysis (i.e., non-missing values are considered) is conducted for computing the corresponding variance components. Then, percentage of explained variability by the zth model term is computed as the ratio between the corresponding variance component and the sum of all the variance components of the model times 100, . The percentage of explained variability was computed for each model for each trait.
2.6. Modules
One of the principal objectives of this manuscript is to provide a template for implementing genomic prediction pipelines in an easy way. First, the different modules that are used for assembling the pipeline are presented. These modules work in a way that the outputs of these become the input of other modules in more advanced stages of the pipeline. The pipeline here presented considers all the different stages from the initial data sets until the prediction stage. Two examples of these pipelines are provided in the
supplementary section.
The structure of the different modules is as follows: basically, all of these are comprised of three directories or folders “code”, “input” and “output”. The “code” folder contains the “mainCode.R” script which performs a specific routine depending on the module. The “input” folder contains a parameter file “parameters.R” where the inputs of the module are specified. During the implementation of the different modules, this is the only file that is modified according to the different conditions (parameters) to consider in the routines. The “output” folder is used for storing the results. Such that the outputs derived from these modules (routines) in previous stages will be used as the corresponding inputs in more advanced stages of the pipeline.
Initially, the phenotypic and genomic data are stored in a common repository/directory called “1.root.Data”. Then the modules will refer to these data sets at the different stages of the pipeline. It is assumed that the genomic and phenotypic information (including missing values) are available for the same genotypes in both datasets. In our case, “Pheno.csv” and “SNPs.rda” correspond to the data sets containing the phenotypic and the molecular marker data. All of the modules used in our pipeline can be found in the ‘modules’ folder in the
supplementary section. For assembling the pipeline at the different stages, these modules should be copied and renamed, and only the “parameter.R” file should be modified.
2.7. SplittingSNPs: Module for Applying Quality Control on Marker Data
The “SplitingSNPs” module performs the quality control based on missingness of the marker data by discarding molecular markers that exceed a given proportion of missing values (PMV). In the “parameters.R” file it should be specified the path (mm.file) to the marker data set “SNPs.rda”, the highest proportion of missing values PMV (NaN.freq.i) to tolerate for including a marker SNP in the analysis. After applying the QC, the resulting set of maker SNPs (“X.csv”) is stored in the “output” folder.
2.8. Gmatrix: Module for Constructing the Covariance Matrices Using Genomic and Environmental Factors
The “Gmatrix” module is used for constructing the relationship matrices between pairs of genotypes and between pairs of environments using the matrix of marker SNPs or the name (or ID) of the environments, respectively. In the “parameters.R” file, the path to the matrix of phenotypes (phenotype.file) and the path to the matrix of molecular markers (mm.file) should be specified for computing the kinship matrix using genomic data. If the value of “mm.file” is declared as “NULL” the covariance structure between genotypes or between environments is computed using the incidence matrices only. The value of the smallest allele frequency to tolerate for including markers in the analysis is specified with the “prop.MAF.j” option. The “colIDy” parameter indicates the column in the matrix of phenotypes that will be used to link the genotypes with marker data or phenotypes with environments. The outputs of this module will be store in the “output” folder and these are the resulting kinship matrix (G.rda), and its corresponding eigen value decomposition (EVD.rda) which are necessary to compute more elaborate model terms and for setting up the linear predictor, respectively.
2.9. Z: Module for Constructing Incidence Matrices for Genotypes and Environments
The “Z” module is used to include the main effects of genotypes or environments. In the “parameters.R” file in the “input” folder, the path to the matrix (phenotype.file) that contains the phenotypic information and the column (colIDy) that contains the information of the ID of the genotypes or environments should be specified. In the “output” folder the resulting incidence matrix “Z.rda” is stored as well as a graphical representation of the distribution of phenotypes across genotypes or environments (exp.des.pdf).
2.10. Imatrix: Module for Constructing the Interaction Matrix between Markers and Environments
The “Imatrix” module is used to compute the Hadamard product (or cell-by-cell product) between two covariance structures. The resulting matrix is needed for including the interaction between molecular markers and environments [
25]. In the “parameters.R” file in the ”input” folder, the path to the resulting matrices of the two factors (G1.file and G2.file) that will be considered in the interaction (G and E in our case) should be specified. The “output” folder will contain the resulting covariance matrix (G.rda) and its corresponding eigen value decomposition (EVD.rda).
2.11. Preparing.CV1.CV2: Module for Assigning Genotypes and Phenotypes to Folds
This module is used for assigning phenotypes/genotypes to training-testing sets for CV1 and CV2 cross-validation schemes. In the “parameters.R” file in the “input” folder, it should be specified the path to the matrix of phenotypes (phenotype.file), the number of folds to consider (folds) in the cross-validation, the column in the matrix of phenotypes (colIDy) that contains the names (IDs) of the genotypes, the type of cross-validation (either CV1 or CV2 or both). Also, it is possible to fix the seed value needed in the randomization and it varies between the replicates of the training-testing assignation. The resulting matrix (Y.csv) will be stored in the “output” folder. This matrix is identical to the initial matrix of phenotypes except that those column(s) containing the information of the fold assignation are added at the end of the matrix.
2.12. Preparing.CV0 and CV00 Module
Since one of the goals of this research is to provide a cross-validation scheme that preserves comparable sample sizes across different cross-validation schemes, the results from the cross-validation assignation CV1 and CV2 are used as inputs for the assignation of CV0 and CV00 schemes. For CV0 scheme, the resulting matrix from the previous module when conducting the CV2 assignation acts as input argument. In the “parameters.file” in the “input” folder, the path (phenotype.file) to the output file (Y.csv) derived from the “Preparing.CV2.CV1” module is specified, also the number of folds (it should be the same number of folds than in the previous output), the column that contains the phenotypic information (colPhen), and the column that contains the information of the folds (colFolds) for the CV2 scheme. The “output” folder will contain the resulting matrix of phenotypes. In this case, depending on the number of folds, the same number of extra columns are added masking as missing values those phenotypes belonging to the different folds at the different columns (one column for each fold). For the CV00 assignation, a similar procedure is performed but in this case, the column of the assignation of folds for CV1 scheme is used instead. Also, the resulting matrix of phenotypes is stored in the “output” folder.
2.13. Fitting.Models: Module for Performing the Predictions of the Missing Values and Compute the Variance Components
This module is used for fitting the models, perform the predictions of missing values and computing the variance components. In the “parameters.R” file in the “input” folder, the path to the matrix of phenotypes (phenotype.file) and the ID of the partitions (folds) to be predicted (e.g., ID of the folds [1, 2, 3, 4, and 5] or the ID of the environments [CV0 and CV00]) are specified. Also, since the BGLR [
26,
31] R package [
40] was used for model fitting and it is based on the Bayesian framework, it is also necessary to specify the number of iterations (nIter) for the GIBBS sampler and the number of iterations to be used as burn-in (burnIn). Then, the linear predictor is built by providing the different models terms and their corresponding assumptions.
For this, a list is started “AB <- list()” to add the paths to the different model matrices that were created in previous stages. Such that the ith element of the list corresponds to the ith model term “AB[[i]] <-”. Also, it is necessary to specify the type of the effects with “FIXED” for a fixed effect, “BRR” for a random main effect (RR-BLUP) or “RKHS” for the GBLUP model. In addition, it is necessary to provide the column numbers in the matrix of phenotypes that contain the ID (colVAR) of the genotypes, the phenotypic information (colPhen), the different training-testing partitions (colCV, folds or environments), and set/fix the seed for replicating exactly same the results “set.seed(i)” with the GIBBS sampler.
2.14. Pipeline
Each one of the different stages (2–6) of the pipeline is built using the modules stored in the repository folder (modules). For this, it is necessary to copy and rename these according to the different stages and only modify the “parameter.R” file in the “input” folder. The pipeline starts with the “1.root.Data” folder where the files with phenotypic (Phenos) and genomic information (SNPs) are stored. The next stage considers the implementation of quality control (QC) on the genomic data based on missing values and it corresponds to the “2.splitingSNPs” folder. Here, the path to the matrix of marker SNPs and the PMV should be provided; the resulting matrix “X.csv” is stored in the “output” folder.
The next stages consider the computation of the different model terms. The main and the interaction effects are stored in folders “3.Gmatrices” and “4.Imatrices”, respectively. In folder “3.Gmatrices”, the covariance matrices for “G” and “E” using the marker data and the ID of the environments, are computed. In addition, the incidence matrices that connects phenotypes with genotypes “ZL” and with environments “ZE” are also obtained. In the “4.Imatrices” folder, the outputs of the covariance matrices “G” and “E” are used to compute the interaction matrix “G×E”.
The 5th stage corresponds to the training-testing assignation, and it is divided in 3 sections. In the 5.1.CV2.CV1 section, for each replicate (1–10) the folds (5) are assigned for the cross-validations schemes CV1 (predicting new genotypes in observed environments) and CV2 (incomplete field trials). The resulting matrices are stored in the corresponding “output” folder. For the cross-validation CV0 and CV00, the configuration between them is very similar. Under CV0, in the folder “5.2.CV0” for each trait × replicate combination the training-testing assignation at each environment is performed by masking as missing values the corresponding observations derived from the CV2 scheme. While for CV00, the corresponding observations derived from CV1 scheme were considered.
The 6th stage corresponds to the prediction of missing values and it also comprises three sections. These correspond to the three different assignation schemes in the previous stage. Here, the “fitting.Models” module stored in the “modules” folder was implemented. For CV1 and CV2, for each trait × model × replicate combination, the prediction of the missing values is conducted, and the results are stored in the “output” folder according to the different folds (1–5). The model fitting for CV0 and CV00 schemes was performed for each trait × model × replicate × fold combination and the resulting predicted values are stored in the “output” folder.
Finally, the 7th stage considers the computation of the variance components. For this, the same module as in the previous state “fitting.Modules” was implemented for each trait × model combination by performing a full data analysis (i.e., no missing values were considered). Here, it is necessary to assign “−999” to the “folds” parameter in the “parameters.R” file. The resulting file “fm_full.R.Data” stored in the output folder contains the obtained variance components among other objects.
5. Conclusions
GS is a widely adopted method in plant and animal breeding programs, and conceptually its implementation is easy to follow. Initially, it requires a set of genomic and phenotypic data for model calibration for predicting the performance of candidate genotypes in target environments. However, in order to achieve the highest prediction accuracy between predicted and observed values, many factors should be assessed through cross-validation studies for a correct implementation of GS in real prediction problems.
For this reason, factors such as quality control on marker covariates, suitable cross-validation schemes mimicking real prediction scenarios, and the election of the prediction model among others should be evaluated. In most of the cases, the evaluation of these factors corresponds to minimal variations on the set of parameters to evaluate in the pipeline. Thus, there is no need to start from scratch the set of analyses when modifying the levels of the parameter(s) of interest(s). This allows the reuse of already developed code at different stages of the pipeline. Thus, it is easy to perform simple modifications in these codes to adapt them to particular cases.
In this study, we provide a set of modules that can be easily assembled to build complex prediction pipelines where the outputs of the early stages become the input of the more advanced ones. One feature of the proposed modules is that these can be used in a black box fashion where the specifics of the different analyses are controlled with a parameter file and there is no need to modify the main script (mainCode.R). With respect to the different cross-validation schemes, we provide a novel framework that allows similar sample sizes in calibration and prediction sets such that the results of the different prediction scenarios can be directly contrasted.
Finally, with respect to the obtained results in both data sets, we confirm again the advantages of considering the genotype-by-environment interaction in prediction models under the cross-validation schemes CV2 and CV1. Under the CV0 scheme, mixed results were observed for the first data set D1 while for D2 in most cases the main effects model was slightly superior. With CV00 scheme, no significant differences were observed for both data sets. We conclude, that using similar sample sizes in training sets the genotype-by-environment interaction can be leveraged when a portion of the data in the target environment has been observed via other genotypes while for the case of novel environments, there is a need of incorporating other sources of information such as soil and weather data to improve results.