1. Introduction
With the continuous growth of the global population and the uncertainties brought about by climate change, issues related to crop yield and food security are becoming increasingly prominent. Improving crop yield and the stability of desirable traits has become one of the important topics in contemporary crop breeding research [
1]. Traditional hybrid breeding is based on phenotypic observation to screen out individuals with stable and excellent traits, which usually require many years and multiple generations and are highly influenced by the environment, making it difficult to distinguish true genetic differences [
2]. Breeding value refers to the ability of an individual to pass on its superior traits to generations [
3]. By evaluating breeding value in the process of hybrid breeding, breeders are able to more effectively select and breed new excellent varieties, thereby accelerating the breeding process and improving breeding results.
With the development of genetics, statistics, and molecular biology, scientists have developed various methods to evaluate breeding value. From the 1970s to the 1990s, Best Linear Unbiased Prediction (BLUP) and Marker-Assisted Selection (MAS) became effective methods for breeders to conduct genetic evaluations [
4]. However, these methods have limitations in the process of crop breeding improvement. For example, BLUP requires populations with pedigree information, and MAS has limited predictive power for quantitative traits [
5]. With the significant reduction in genotyping costs, genomic selection (GS), proposed by Meuwissen et al. in 2001 [
6], overcame the shortcomings of MAS by estimating individual breeding values based on high-density markers across the genome, allowing for the use of genome-wide information to predict phenotypes, gradually changing traditional crop breeding methods. Initially, multi-layer linear models such as GBLUP and Bayesian were commonly used tools for phenotype prediction, but they only considered additive marker effects, leading to biased genetic estimates [
7]. When faced with more complex breeding scenarios and high-throughput data, their predictive capabilities were limited, and the prediction accuracy was difficult to meet the needs of crop breeding [
8].
In 2012, deep learning (DL) made significant progress in image recognition fields, which sparked the scientific community’s interest in its potential in genomic selection. DL is gradually becoming an important tool for phenotype prediction due to its capabilities in automatic feature learning, nonlinear modeling, high-dimensional data processing, and integrating multimodal data [
9]. Compared with traditional methods, DL can extract deep information from complex genomic and phenotypic data, improving the accuracy and efficiency of phenotype prediction [
10]. For example, the CNN-based DeepGS employs convolution, sampling, and dropout strategies to reduce data dimensions; it uses genotype matrices as input and has an improvement ranging from 1.44% to 65.24% compared to the RR-BLUP method when predicting phenotypes but requires a large amount of model training time [
11]. DNNGP designs multi-layer processing units to learn complex feature representation and evaluates the performance of methods using multiple plant breeding datasets, achieving a prediction accuracy of 79% for maize flowering dates, but there is a certain risk of overfitting when the training dataset is small [
12]. DeepCCR, based on CNN and bidirectional long short-term memory (BiLSTM), has a higher prediction accuracy for rice traits than four other models, including DNNGP [
13]. DeepCGP, based on data compressed by Autoencoder, achieves a maximum prediction accuracy of 99% for a specific rice trait but has relatively low prediction accuracy in other traits (e.g., tiller number and grain width) [
14].
Current research mainly focuses on constructing phenotype prediction models based on CNNs, with various improvements in prediction performance and computation time. However, there are limitations in capturing the complex relationships between genotype and phenotype, and there is still room for improvement in the accuracy of phenotype prediction. Therefore, this study innovatively proposes a new method called DeepAT for predicting wheat yield based on genomic data, which mainly includes an input layer, a data feature extraction layer, a feature relationship capturing layer, and an output layer. The data feature extraction layer reduces dimensionality and extracts high-dimensional features from the input data, while the feature relationship capturing layer further captures the complex relationships between features based on low-dimensional features through multi-head self-attention layers and feedforward neural networks. As a result, DeepAT has significant advantages in feature extraction, model convergence, capturing feature relationships, etc. Compared to other three machine learning methods and six deep learning methods, DeepAT can further capture the complex relationships between genotype and phenotype, achieving a higher accuracy in predicting wheat yield. This method provides a new tool for crop phenotype prediction and brings novel insights for deep learning in genomic selection.
2. Materials and Methods
2.1. Phenotype and Genotype Data
The data used in this paper comes from Australian Grains Technologies (AGT) and includes wheat yield phenotypes along with a matching set of genotypic markers [
15]. The selected plants in this dataset are from early and advanced generation breeding lines of AGT’s wheat breeding programs, planted in Australia in 2014, comprising a total of 10,375 breeding lines. The grain yield was calculated at harvest. The adjusted grain yield, which consists of grain yield de-regressed BLUPs with site mean added, was defined as the input phenotype values for model training. The genotype data consists of a set of high-quality whole-genome genetic markers that span all 21 chromosomes of wheat. The alleles are encoded as: (AA, AB, BB) = (1, 0, −1). The genetic marker matrix is defined as M = [M1……Mp], where p represents the number of markers covering all 21 chromosomes of wheat [
16] (
Figure 1a–c).
The above-mentioned genotype and phenotype data are from public datasets and can be directly used as input data for the models. In this study, the dataset was divided into a training set and a testing set in a 9:1 ratio, with 90% of the genotype and phenotype data used to train all models, and 10% of the data with the phenotype information removed used as the testing set to validate the model’s predictive performance on different metrics (
Figure 1d).
2.2. DeepAT Method for Wheat Phenotype Prediction
2.2.1. Overall Framework of the Model
Previous research has proposed models such as DeepGS and DNNGP, which are deep learning models based on CNN. These models still have room for improvement in capturing nonlinear relationships and prediction accuracy. Based on this, we innovatively propose the DeepAT model to predict wheat yield based on genotype data.
The DeepAT model framework consists of four layers: the input layer, the data feature extraction layer, the feature relationship capture layer, and the output layer (
Figure 1g). The data in the input layer is the genetic matrix and adjusted yield values, which are divided into a training set and a test set in a ratio of 9:1, using the testing set without phenotype values for yield prediction. The data feature extraction layer (
Figure 1e) utilizes Autoencoder’s efficient data encoding capabilities to perform dimensionality reduction and feature extraction on the input independent variables, generating a low-dimensional latent representation that extracts valid features from the dataset and is capable of handling complex, high-dimensional genotype data. The feature relationship capture layer (
Figure 1f), based on the low-dimensional features extracted from the previous layer, further captures the complex relationships and long-range dependencies between features using the self-attention mechanism of the Transformer, allowing for a comprehensive understanding of the interactions between SNP sites within the genome. The specific process is illustrated in
Figure 1g, where the data are divided into the input layer and enter the data feature extraction layer, which compresses and reconstructs the preprocessed data, reducing its dimensionality while extracting effective features. The low-dimensional features are then input into the feature relationship capture layer, processed through multi-head self-attention layers and feedforward neural networks, ultimately outputting the wheat yield predicted values.
DeepAT not only enables a deep understanding and extraction of data features but also captures the complex relationships between genotypes and phenotypes. This enhances the model’s expressive power and predictive accuracy, giving it broad potential in genomic prediction.
2.2.2. Data Feature Extraction Layer
The data feature extraction layer utilizes the data dimensionality reduction and feature extraction capabilities of Autoencoders, which can handle complex high-dimensional genotype data and extract effective features from the dataset. An Autoencoder is an unsupervised learning model typically composed of an encoder and a decoder. It is suitable for regression tasks that require extracting important features from high-dimensional data [
17]. The encoder part is designed with a multi-layer, fully connected neural network, which extracts effective information by gradually reducing the feature dimensions layer by layer. Each layer introduces a rectified linear unit (ReLU) activation function to enhance the expression capability of nonlinear features and accelerate the model’s convergence speed, ultimately outputting a low-dimensional latent variable to complete feature extraction. The design of the decoder, on the other hand, increases the number of neurons to reconstruct the original input, ensuring that the model can recover the key information of the input data. The main advantages are as follows:
(1) Feature extraction. Autoencoder can extract low-dimensional and representative feature vectors from high-dimensional SNP data, which is very effective in reducing the data dimensionality, removing noise, and capturing the underlying genetic structure [
18].
(2) Data dimensionality reduction. Autoencoder achieves data compression and reconstruction through encoders and decoders, which helps the model to handle high-dimensional and complex genotype data, reduce the computational complexity of the model, and achieve fast processing and analysis of the data while retaining as much useful information as possible, which helps to improve the generalization ability of the subsequent model.
The calculation formula for loss function is as follows:
The formula for the encoder is as follows:
The formula for the decoder is as follows:
where
denotes the input data point,
f denotes the encoder function, which maps the input
x to the low-dimensional space,
g denotes the decoder function, which maps the low-dimensional representation back to the original data space,
λ is a hyperparameter that controls the degree of influence of the regularization term,
is the regularization term to prevent overfitting,
and
are the weight matrices of the encoder and the decoder,
and
are the bias vectors of the encoder and the decoder, respectively,
σ is the activation function ReLU, and
is the reconstructed output.
2.2.3. Feature Relationship Capture Layer
The feature relationship capture layer utilizes the feature learning capabilities of Transformers to capture the complex relationships between features, allowing for in-depth understanding and analysis of low-dimensional features. Transformer is a model based on an attention mechanism that was originally used for natural language processing tasks [
19], but its architecture is applicable to other types of data as well. The feature relationship capture layer is designed with multiple layers of Transformer blocks, each containing multi-head self-attention mechanisms, feedforward neural networks, residual connections, and layer normalization. The main advantages of this design are as follows:
(1) Capturing feature relationships. The Transformer utilizes a self-attention mechanism to directly capture the dependency relationships between any two positions in a sequence without distance limitations, allowing the model to effectively capture complex relationships between features from low-dimensional representations. Additionally, the Transformer retains positional information of elements in the sequence through positional encoding, helping the model understand the specific location of SNPs on chromosomes and their impact on phenotypes, which is beneficial for maintaining the importance of genomic structure.
(2) Model stability. The parallel computing capability of Transformers can reach a stable state more quickly during training compared to traditional RNN structures [
20], making the model training process more efficient and improving the model’s prediction efficiency and stability.
The core components of the feature relationship capture layer include multiple self-attention layers and feedforward neural networks. For a single self-attention layer, the computational formula can be expressed as follows:
The
V-values for each attention head were weighted and summed according to the attention score:
The outputs of all the attention heads are spliced together to obtain the final multi-head attention output:
where
Q,
K, and
V denote the Query, Key, and Value matrices, respectively,
denotes the dimension of the key vector, and
denotes the weight matrix.
2.3. Operational Environment
To effectively run DeepAT, we designed a high-performance computing environment. First, in terms of hardware, we used machines with at least a 4-core Intel Core i7 processor, paired with NVIDIA GeForce RTX series GPUs that have at least 6 GB of video memory to accelerate the computations of deep learning models. Additionally, we equipped the setup with 32 GB of RAM and a fast SSD to ensure smooth data processing and model training. On the software side, we chose the stable Ubuntu 20.04 LTS operating system and Python 3.7.7 as the programming language, along with the PyTorch 1.13 deep learning framework to build the core of the model. Furthermore, we integrated libraries such as NumPy 1.17.3, Pandas 1.3.5, Matplotlib 3.1.2, Scikit-learn 1.0.2, and Hugging Face’s Transformers, which are responsible for data processing, analysis, visualization, model evaluation, and management of Transformer models, respectively. We used Anaconda to manage the Python environment and dependencies, creating and activating a virtual environment named “wheat_yield” through conda commands, where we installed all the necessary packages. Finally, we ensured the preprocessing and formatting of data to meet the input requirements of the model. In this environment, our model was efficiently trained and tested to achieve accurate predictions of wheat yield.
2.4. Methods Used for Comparison
To evaluate the predictive accuracy of DeepAT, three machine learning methods—Bayesian Regression, Random Forest, and Support Vector Machine (SVM)—along with six deep learning methods—Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention—were selected for performance comparison on the same dimension wheat dataset to verify the performance advantages of DeepAT.
2.4.1. Machine Learning Methods
Bayesian regression is a statistical method for estimating the probability distribution of regression parameters using Bayes’ theorem [
21]. It can handle parameter uncertainty, integrate prior knowledge, and provide a distribution of parameters rather than a single estimate. This is particularly important when faced with a finite dataset, as shown in the following expression:
where
is the posterior probability, i.e., the probability of the parameter
θ given the data y and the feature
X;
is the likelihood function, which represents the probability of the data
y given the feature
X and the parameter
θ; and
is the prior probability, which represents the prior probability distribution of the parameter θ.
Random Forest is a decision tree-based machine learning algorithm that makes final predictions by constructing multiple decision trees and voting on their predictions [
22]. It is robust, can handle high-dimensional data, assesses the importance of features, and is not prone to overfitting [
23]. Its predictions can be expressed by majority voting (classification) or averaging (regression) of the output of individual decision trees:
where
denotes the prediction of the
t tree and
T is the number of trees.
Support Vector Machine (SVM) is a supervised learning algorithm mainly used for classification and regression analysis. The core idea is to find a hyperplane in the feature space to maximize the margin between different classes [
24]. SVM can solve nonlinear problems by kernel trick. It is suitable for high-dimensional space and is computationally efficient. The expression is as follows:
where
is the kernel function to compute the similarity between two samples;
is the Lagrange multiplier, which is determined by the training process;
is the category labels of the samples;
b is the bias term; and
n is the number of support vectors.
2.4.2. Deep Learning Methods
Fully Connected Neural Network (Fcnn) is one of the most basic neural network architectures in which all neurons in each layer are connected to all neurons in the next layer. All layers are densely connected and usually contain multiple layers to achieve nonlinear mapping [
25]. This network structure is commonly used to process vectorized data as well as fixed-size inputs.
For an Fcnn with
l layers, the output of layer
can be expressed as follows:
where
is the weighted input of layer
plus the bias,
and
are the weight matrix and bias vector of layer
, respectively;
σ is the activation function; and
is the activation output of layer
.
Autoencoder_Fcnn uses Fcnn to process the features extracted by the autoencoder for final phenotype prediction. Autoencoder_Transformer_Fcnn adds Transformer to Autoencoder_Fcnn to improve the ability to capture complex relationships. Transformer_Fcnn uses Transformer directly to extract features and then makes phenotype predictions through Fcnn.
Convolutional Neural Network (Cnn) is mainly used to process data with lattice structure, such as images or time series, and it is able to automatically learn partial features from input data and construct high-level feature representations through multiple layers of abstraction. Long Short-Term Memory network (Lstm) is a special kind of Recurrent Neural Network which can effectively capture long-term dependencies from time-series data in phenotype prediction by introducing memory units and gating mechanisms to store long-term information. Attention mechanism imitates the way of attention allocation in human visual and cognitive processes, enabling the model to automatically focus on certain important parts of the input data to better capture features associated with specific phenotypes.
In the Cnn_Lstm architecture, the output of the Cnn layer will be used as the input for the Lstm layer. Assuming the Cnn layer’s output feature map size is H × W × C, where H and W are the height and width, and C is the number of channels, these features will be spread or rearranged into a sequence for input into the Lstm [
26]. Cnn_Lstm_Attention enables the model to focus on the most important parts of the input sequence by incorporating the Attention Mechanism. In Cnn_Lstm_Attention, the attention mechanism is typically used in the output of the Lstm layer to generate a weighted context vector
c, which is computed based on the weights of the Lstm hidden states and the input sequence:
where
is the hidden state of the LSTM at each time step
i, and
is the attentional weight, computed by an additional fully-connected layer that accepts
as an input and produces a weight score, which is then normalized to a probability distribution.
2.5. Model Validation Metrics
We divided the wheat dataset into a training set and a testing set in a 9:1 ratio, using the dataset with the same dimensions to train and test all models. We conducted a performance comparison of three metrics on the testing set without phenotype data. The evaluation metrics for model prediction performance are the Average Relative Error (ARE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (R). The prediction accuracy of models is defined as 1 minus ARE.
ARE is used to measure the relative error between the predicted value and the observed value, taking into account the proportion of the error relative to the true value. It is particularly suitable for comparing different magnitudes of data [
27]. The calculation formula is as follows:
MSE measures the average squared error between predicted and observed values, which is more sensitive to large prediction errors and therefore better captures model bias [
28]. The formula is calculated as follows:
R is used to measure the strength and direction of the linear relationship between two variables. In prediction problems, it can be used to assess the degree of consistency between the predicted values and the observed values. The value of R is in the range of (−1, 1). The closer the value is to 1 or −1, the stronger the linear relationship. A positive sign indicates a positive correlation, a negative sign indicates a negative correlation, and 0 indicates no linear relationship. The calculation formula is as follows:
where
N is the number of samples,
and
are the actual and predicted values of the
sample, respectively;
and
are the average of the actual and predicted values of all the samples, respectively.
3. Results
3.1. Comparison Between Deep Learning Model Training Loss Variation
To better evaluate DeepAT model performance, six typical deep learning models for phenotype prediction, Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention were selected to compare model training loss variation (
Figure 2). From the training loss curves, both Autoencoder_Fcnn and Autoencoder_Transformer_Fcnn exhibited relatively flat loss curves, indicating that they converge quickly in the early stages of training, but may have limited expressiveness on complex tasks or face the risk of overfitting. In contrast, the loss curve of the Transformer model showed significant fluctuations, suggesting that this model continuously optimizes during training and possesses good generalization ability. Cnn_Lstm and Cnn_Lstm_Attention experienced a rapid decrease in loss in the early stages of training, followed by stabilization, reflecting their effectiveness in handling sequential data, but also indicating limited optimization space in the later stages of training. The loss curve of the DeepAT model decreased rapidly in the initial phase, followed by considerable fluctuations in the later stages, indicating that the model converges quickly at first and continuously optimizes during training, demonstrating efficient learning ability, but there is a risk of overfitting as training continues. Overall, DeepAT performs excellently in handling data, but the stability of long-term training needs to be improved by increasing the dataset type or adopting more complex optimization strategies for further adjustment and optimization.
3.2. Prediction Accuracy of DeepAT Comparing with Other Methods
In order to test the accuracy and performance of DeepAT, comparative experiments were conducted by adding three currently popular machine learning models (Bayesian Regression, Random Forest, and SVM) to six typical deep learning models (Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention) (
Figure 3).
For ARE, compared to deep learning, the ARE for yield prediction using machine learning is larger, with the Random Forest having the highest ARE at 10.46%. The AREs for Cnn_lstm and Cnn_lstm_attention are relatively small, at 0.53% and 0.93%, respectively, while DeepAT has the smallest ARE at 0.02%. In terms of MSE, DeepAT has the lowest MSE (28.93), while the three deep learning methods combined with Fcnn (Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn) have larger MSEs, even exceeding those of the three machine learning methods.
For prediction accuracy, compared to three machine learning models (Bayesian Regression, Random Forest, and SVM), DeepAT’s prediction accuracy is higher by 9.42%, 10.44%, and 9.97%, respectively. Compared to six deep learning models (Transformer, Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Cnn_Lstm, Cnn_Lstm_Attention), DeepAT’s prediction accuracy is higher by 0.81%, 8.36%, 8.36%, 7.81%, 0.51%, and 0.91%, respectively. The results show that the deep learning approach combined with Fcnn performs poorly for yield prediction, with a lower prediction accuracy than the machine learning model. DeepAT outperforms all other models for yield prediction, with the highest accuracy in predicting wheat yield (99.98%), with an MSE of only 28.93 tones and R close to 1.
3.3. Comparison of Correlation Between Yield Observed Values and Predicted Values
The Pearson correlation coefficient (R) can be used to assess the degree of consistency between the predicted values and the observed values. In this study, We compared DeepAT with six typical deep learning models (Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention) and three machine learning models (Bayesian Regression, Random Forest, and SVM) by evaluating the R between the yield predicted values and the observed values to assess the predictive performance of the models (
Figure 4).
First, traditional machine learning methods, such as Bayesian Regression, Random Forest, and SVM, demonstrated some predictive capability, but their R was relatively low, indicating a certain degree of deviation between the yield predicted values and the observed values. In contrast, deep learning models, including Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer, Transformer_Fcnn, Cnn_Lstm, and Cnn_Lstm_Attention, exhibited stronger predictive abilities, particularly the Transformer, Cnn_Lstm, and Cnn_Lstm_Attention, which had higher R, showing that the yield predicted values were closer to the observed values. DeepAT performed the best among all models, with its R approaching 1, indicating that its yield predicted values closely match observed values. Therefore, DeepAT has significant advantages in phenotype prediction.
4. Discussion
With the increasing amount of omics data, the use of deep learning methods to enhance the predictive ability for complex traits in plant breeding has become a common research approach. In the process of utilizing deep learning to assist in crop genetic breeding, previous studies [
29,
30,
31,
32] have developed a series of phenotype prediction models that have made improvements in various aspects, such as predictive performance and computation time. However, there is still room for improvement in capturing the complex relationships between genotype and phenotype, as well as in prediction accuracy. This study proposes a new phenotype prediction method called DeepAT based on deep learning, which mainly includes an input layer, a data feature extraction layer, a feature relationship capturing layer, and an output layer. Compared to other three machine learning methods and six deep learning methods on a public wheat dataset, DeepAT outperforms all other methods and can predict wheat phenotype values based on genotype data. There are several innovations in the following areas:
(1) Feature extraction. DeepAT can extract representative feature vectors from high-dimensional SNP data. By introducing the ReLU activation function, it enhances the model’s ability to express nonlinear features, which helps accelerate the model’s convergence speed.
(2) Dimensionality reduction. DeepAT can handle high-dimensional complex genotype data, reduce computational complexity, and quickly process and analyze data while retaining as much useful information as possible, which helps improve the generalization ability of subsequent models.
(3) Capturing feature relationships. DeepAT effectively captures the complex relationships between features from low-dimensional features using a self-attention mechanism, allowing it to understand the specific locations of SNPs on chromosomes and their impact on phenotypes.
(4) Model stability. DeepAT utilizes the parallel computing capabilities of Transformers to make the model training process more efficient. Compared to traditional RNN structures, it can reach a stable state more quickly, enhancing training efficiency.
DeepAT, by combining the above advantages, can better capture the complex relationship between genotype and phenotype and is able to predict wheat phenotype values based on genotype data, demonstrating higher predictive accuracy than other methods. However, the public datasets used in this study are limited to a single crop species and lack multi-year and multi-site data, which restricts the model’s generalization ability. This may lead to a decline in the ability to make accurate predictions in new environments, affecting the stability of the model’s predictions. Nevertheless, DeepAT still has significant potential for crop phenotype prediction. Future research needs to increase the diversity of training datasets, introduce various crop species, provide data covering multiple environments and time periods, and adopt better algorithms for adjustment and optimization, thereby improving the model’s generalization ability and ensuring the stability of its predictions.
5. Conclusions
The phenotype prediction model DeepAT proposed in this study has significant advantages in predicting wheat yield, achieving a prediction accuracy of 99.98%, with an MSE of only 28.93 tones, and a Pearson correlation coefficient close to 1, indicating that the yield predicted values closely match the observed values. However, yield is a complex quantitative trait [
33], and further analysis of multi-year, multi-site, and multi-species datasets is needed to enhance the model’s generalization ability and predictive performance to meet the prediction needs of different scales, crops, and traits.
In the future, the data and models used in this study can be directly integrated into an Online Analysis and Mining Platform for Agricultural Science (
http://47.106.253.187/#/, accessed on 16 November 2024) to facilitate the practical application of DeepAT. As an advanced deep learning framework, DeepAT provides an effective tool for phenotype prediction, bringing a new perspective to deep learning-assisted genomic selection, and holds great potential in intelligent breeding.