DeepAT: A Deep Learning Wheat Phenotype Prediction Model Based on Genotype Data

Li, Jiale; He, Zikang; Zhou, Guomin; Yan, Shen; Zhang, Jianhua

doi:10.3390/agronomy14122756

Open AccessArticle

DeepAT: A Deep Learning Wheat Phenotype Prediction Model Based on Genotype Data

by

Jiale Li

^1,2,†

,

Zikang He

^1,2,†,

Guomin Zhou

^3,4,

Shen Yan

⁵ and

Jianhua Zhang

^1,2,*

¹

National Nanfan Research Institute, Chinese Academy of Agriculture Science (CAAS), Sanya 572024, China

²

Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China

³

National Agriculture Science Data Center, Beijing 100081, China

⁴

Nanjing Institute of Agricultural Mechanization, Ministry of Agriculture and Rural Affairs, Nanjing 210014, China

⁵

Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2024, 14(12), 2756; https://doi.org/10.3390/agronomy14122756

Submission received: 18 October 2024 / Revised: 19 November 2024 / Accepted: 20 November 2024 / Published: 21 November 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Genomic selection serves as an effective way for crop genetic breeding, capable of significantly shortening the breeding cycle and improving the accuracy of breeding. Phenotype prediction can help identify genetic variants associated with specific phenotypes. This provides a data-driven selection criterion for genomic selection, making the selection process more efficient and targeted. Deep learning has become an important tool for phenotype prediction due to its abilities in automatic feature learning, nonlinear modeling, and high-dimensional data processing. Current deep learning models have improvements in various aspects, such as predictive performance and computation time, but they still have limitations in capturing the complex relationships between genotype and phenotype, indicating that there is still room for improvement in the accuracy of phenotype prediction. This study innovatively proposes a new method called DeepAT, which mainly includes an input layer, a data feature extraction layer, a feature relationship capture layer, and an output layer. This method can predict wheat yield based on genotype data and has innovations in the following four aspects: (1) The data feature extraction layer of DeepAT can extract representative feature vectors from high-dimensional SNP data. By introducing the ReLU activation function, it enhances the model’s ability to express nonlinear features and accelerates the model’s convergence speed; (2) DeepAT can handle high-dimensional and complex genotype data while retaining as much useful information as possible; (3) The feature relationship capture layer of DeepAT effectively captures the complex relationships between features from low-dimensional features through a self-attention mechanism; (4) Compared to traditional RNN structures, the model training process is more efficient and stable. Using a public wheat dataset from AGT, comparative experiments with three machine learning and six deep learning methods found that DeepAT exhibited better predictive performance than other methods, achieving a prediction accuracy of 99.98%, a mean squared error (MSE) of only 28.93 tones, and a Pearson correlation coefficient close to 1, with yield predicted values closely matching observed values. This method provides a new perspective for deep learning-assisted phenotype prediction and has great potential in smart breeding.

Keywords:

phenotype prediction; genomic selection; deep learning; Autoencoder; Transformer

1. Introduction

With the continuous growth of the global population and the uncertainties brought about by climate change, issues related to crop yield and food security are becoming increasingly prominent. Improving crop yield and the stability of desirable traits has become one of the important topics in contemporary crop breeding research [1]. Traditional hybrid breeding is based on phenotypic observation to screen out individuals with stable and excellent traits, which usually require many years and multiple generations and are highly influenced by the environment, making it difficult to distinguish true genetic differences [2]. Breeding value refers to the ability of an individual to pass on its superior traits to generations [3]. By evaluating breeding value in the process of hybrid breeding, breeders are able to more effectively select and breed new excellent varieties, thereby accelerating the breeding process and improving breeding results.

With the development of genetics, statistics, and molecular biology, scientists have developed various methods to evaluate breeding value. From the 1970s to the 1990s, Best Linear Unbiased Prediction (BLUP) and Marker-Assisted Selection (MAS) became effective methods for breeders to conduct genetic evaluations [4]. However, these methods have limitations in the process of crop breeding improvement. For example, BLUP requires populations with pedigree information, and MAS has limited predictive power for quantitative traits [5]. With the significant reduction in genotyping costs, genomic selection (GS), proposed by Meuwissen et al. in 2001 [6], overcame the shortcomings of MAS by estimating individual breeding values based on high-density markers across the genome, allowing for the use of genome-wide information to predict phenotypes, gradually changing traditional crop breeding methods. Initially, multi-layer linear models such as GBLUP and Bayesian were commonly used tools for phenotype prediction, but they only considered additive marker effects, leading to biased genetic estimates [7]. When faced with more complex breeding scenarios and high-throughput data, their predictive capabilities were limited, and the prediction accuracy was difficult to meet the needs of crop breeding [8].

In 2012, deep learning (DL) made significant progress in image recognition fields, which sparked the scientific community’s interest in its potential in genomic selection. DL is gradually becoming an important tool for phenotype prediction due to its capabilities in automatic feature learning, nonlinear modeling, high-dimensional data processing, and integrating multimodal data [9]. Compared with traditional methods, DL can extract deep information from complex genomic and phenotypic data, improving the accuracy and efficiency of phenotype prediction [10]. For example, the CNN-based DeepGS employs convolution, sampling, and dropout strategies to reduce data dimensions; it uses genotype matrices as input and has an improvement ranging from 1.44% to 65.24% compared to the RR-BLUP method when predicting phenotypes but requires a large amount of model training time [11]. DNNGP designs multi-layer processing units to learn complex feature representation and evaluates the performance of methods using multiple plant breeding datasets, achieving a prediction accuracy of 79% for maize flowering dates, but there is a certain risk of overfitting when the training dataset is small [12]. DeepCCR, based on CNN and bidirectional long short-term memory (BiLSTM), has a higher prediction accuracy for rice traits than four other models, including DNNGP [13]. DeepCGP, based on data compressed by Autoencoder, achieves a maximum prediction accuracy of 99% for a specific rice trait but has relatively low prediction accuracy in other traits (e.g., tiller number and grain width) [14].

Current research mainly focuses on constructing phenotype prediction models based on CNNs, with various improvements in prediction performance and computation time. However, there are limitations in capturing the complex relationships between genotype and phenotype, and there is still room for improvement in the accuracy of phenotype prediction. Therefore, this study innovatively proposes a new method called DeepAT for predicting wheat yield based on genomic data, which mainly includes an input layer, a data feature extraction layer, a feature relationship capturing layer, and an output layer. The data feature extraction layer reduces dimensionality and extracts high-dimensional features from the input data, while the feature relationship capturing layer further captures the complex relationships between features based on low-dimensional features through multi-head self-attention layers and feedforward neural networks. As a result, DeepAT has significant advantages in feature extraction, model convergence, capturing feature relationships, etc. Compared to other three machine learning methods and six deep learning methods, DeepAT can further capture the complex relationships between genotype and phenotype, achieving a higher accuracy in predicting wheat yield. This method provides a new tool for crop phenotype prediction and brings novel insights for deep learning in genomic selection.

2. Materials and Methods

2.1. Phenotype and Genotype Data

The data used in this paper comes from Australian Grains Technologies (AGT) and includes wheat yield phenotypes along with a matching set of genotypic markers [15]. The selected plants in this dataset are from early and advanced generation breeding lines of AGT’s wheat breeding programs, planted in Australia in 2014, comprising a total of 10,375 breeding lines. The grain yield was calculated at harvest. The adjusted grain yield, which consists of grain yield de-regressed BLUPs with site mean added, was defined as the input phenotype values for model training. The genotype data consists of a set of high-quality whole-genome genetic markers that span all 21 chromosomes of wheat. The alleles are encoded as: (AA, AB, BB) = (1, 0, −1). The genetic marker matrix is defined as M = [M1……Mp], where p represents the number of markers covering all 21 chromosomes of wheat [16] (Figure 1a–c).

The above-mentioned genotype and phenotype data are from public datasets and can be directly used as input data for the models. In this study, the dataset was divided into a training set and a testing set in a 9:1 ratio, with 90% of the genotype and phenotype data used to train all models, and 10% of the data with the phenotype information removed used as the testing set to validate the model’s predictive performance on different metrics (Figure 1d).

2.2. DeepAT Method for Wheat Phenotype Prediction

2.2.1. Overall Framework of the Model

Previous research has proposed models such as DeepGS and DNNGP, which are deep learning models based on CNN. These models still have room for improvement in capturing nonlinear relationships and prediction accuracy. Based on this, we innovatively propose the DeepAT model to predict wheat yield based on genotype data.

The DeepAT model framework consists of four layers: the input layer, the data feature extraction layer, the feature relationship capture layer, and the output layer (Figure 1g). The data in the input layer is the genetic matrix and adjusted yield values, which are divided into a training set and a test set in a ratio of 9:1, using the testing set without phenotype values for yield prediction. The data feature extraction layer (Figure 1e) utilizes Autoencoder’s efficient data encoding capabilities to perform dimensionality reduction and feature extraction on the input independent variables, generating a low-dimensional latent representation that extracts valid features from the dataset and is capable of handling complex, high-dimensional genotype data. The feature relationship capture layer (Figure 1f), based on the low-dimensional features extracted from the previous layer, further captures the complex relationships and long-range dependencies between features using the self-attention mechanism of the Transformer, allowing for a comprehensive understanding of the interactions between SNP sites within the genome. The specific process is illustrated in Figure 1g, where the data are divided into the input layer and enter the data feature extraction layer, which compresses and reconstructs the preprocessed data, reducing its dimensionality while extracting effective features. The low-dimensional features are then input into the feature relationship capture layer, processed through multi-head self-attention layers and feedforward neural networks, ultimately outputting the wheat yield predicted values.

DeepAT not only enables a deep understanding and extraction of data features but also captures the complex relationships between genotypes and phenotypes. This enhances the model’s expressive power and predictive accuracy, giving it broad potential in genomic prediction.

2.2.2. Data Feature Extraction Layer

The data feature extraction layer utilizes the data dimensionality reduction and feature extraction capabilities of Autoencoders, which can handle complex high-dimensional genotype data and extract effective features from the dataset. An Autoencoder is an unsupervised learning model typically composed of an encoder and a decoder. It is suitable for regression tasks that require extracting important features from high-dimensional data [17]. The encoder part is designed with a multi-layer, fully connected neural network, which extracts effective information by gradually reducing the feature dimensions layer by layer. Each layer introduces a rectified linear unit (ReLU) activation function to enhance the expression capability of nonlinear features and accelerate the model’s convergence speed, ultimately outputting a low-dimensional latent variable to complete feature extraction. The design of the decoder, on the other hand, increases the number of neurons to reconstruct the original input, ensuring that the model can recover the key information of the input data. The main advantages are as follows:

(1) Feature extraction. Autoencoder can extract low-dimensional and representative feature vectors from high-dimensional SNP data, which is very effective in reducing the data dimensionality, removing noise, and capturing the underlying genetic structure [18].

(2) Data dimensionality reduction. Autoencoder achieves data compression and reconstruction through encoders and decoders, which helps the model to handle high-dimensional and complex genotype data, reduce the computational complexity of the model, and achieve fast processing and analysis of the data while retaining as much useful information as possible, which helps to improve the generalization ability of the subsequent model.

The calculation formula for loss function is as follows:

L_{A E} (x_{i}; θ) = {‖x_{i} - g (f (x_{i}))‖}^{2} + λ R (θ)

(1)

The formula for the encoder is as follows:

f (x_{i}) = σ (W_{e} x + b_{e})

(2)

The formula for the decoder is as follows:

\hat{x} = g (f (x_{i})) = σ (W_{d} f (x_{i}) + b_{d})

(3)

where

x_{i}

denotes the input data point, f denotes the encoder function, which maps the input x to the low-dimensional space, g denotes the decoder function, which maps the low-dimensional representation back to the original data space, λ is a hyperparameter that controls the degree of influence of the regularization term,

R (θ)

is the regularization term to prevent overfitting,

W_{e}

and

W_{d}

are the weight matrices of the encoder and the decoder,

b_{e}

and

b_{d}

are the bias vectors of the encoder and the decoder, respectively, σ is the activation function ReLU, and

\hat{x}

is the reconstructed output.

2.2.3. Feature Relationship Capture Layer

The feature relationship capture layer utilizes the feature learning capabilities of Transformers to capture the complex relationships between features, allowing for in-depth understanding and analysis of low-dimensional features. Transformer is a model based on an attention mechanism that was originally used for natural language processing tasks [19], but its architecture is applicable to other types of data as well. The feature relationship capture layer is designed with multiple layers of Transformer blocks, each containing multi-head self-attention mechanisms, feedforward neural networks, residual connections, and layer normalization. The main advantages of this design are as follows:

(1) Capturing feature relationships. The Transformer utilizes a self-attention mechanism to directly capture the dependency relationships between any two positions in a sequence without distance limitations, allowing the model to effectively capture complex relationships between features from low-dimensional representations. Additionally, the Transformer retains positional information of elements in the sequence through positional encoding, helping the model understand the specific location of SNPs on chromosomes and their impact on phenotypes, which is beneficial for maintaining the importance of genomic structure.

(2) Model stability. The parallel computing capability of Transformers can reach a stable state more quickly during training compared to traditional RNN structures [20], making the model training process more efficient and improving the model’s prediction efficiency and stability.

The core components of the feature relationship capture layer include multiple self-attention layers and feedforward neural networks. For a single self-attention layer, the computational formula can be expressed as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

The V-values for each attention head were weighted and summed according to the attention score:

{h e a d}_{i} = {A t t e n t i o n S c o r e}_{i} V_{i}

(5)

The outputs of all the attention heads are spliced together to obtain the final multi-head attention output:

M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{o}

(6)

where Q, K, and V denote the Query, Key, and Value matrices, respectively,

d_{k}

denotes the dimension of the key vector, and

W^{o}

denotes the weight matrix.

2.3. Operational Environment

To effectively run DeepAT, we designed a high-performance computing environment. First, in terms of hardware, we used machines with at least a 4-core Intel Core i7 processor, paired with NVIDIA GeForce RTX series GPUs that have at least 6 GB of video memory to accelerate the computations of deep learning models. Additionally, we equipped the setup with 32 GB of RAM and a fast SSD to ensure smooth data processing and model training. On the software side, we chose the stable Ubuntu 20.04 LTS operating system and Python 3.7.7 as the programming language, along with the PyTorch 1.13 deep learning framework to build the core of the model. Furthermore, we integrated libraries such as NumPy 1.17.3, Pandas 1.3.5, Matplotlib 3.1.2, Scikit-learn 1.0.2, and Hugging Face’s Transformers, which are responsible for data processing, analysis, visualization, model evaluation, and management of Transformer models, respectively. We used Anaconda to manage the Python environment and dependencies, creating and activating a virtual environment named “wheat_yield” through conda commands, where we installed all the necessary packages. Finally, we ensured the preprocessing and formatting of data to meet the input requirements of the model. In this environment, our model was efficiently trained and tested to achieve accurate predictions of wheat yield.

2.4. Methods Used for Comparison

To evaluate the predictive accuracy of DeepAT, three machine learning methods—Bayesian Regression, Random Forest, and Support Vector Machine (SVM)—along with six deep learning methods—Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention—were selected for performance comparison on the same dimension wheat dataset to verify the performance advantages of DeepAT.

2.4.1. Machine Learning Methods

Bayesian regression is a statistical method for estimating the probability distribution of regression parameters using Bayes’ theorem [21]. It can handle parameter uncertainty, integrate prior knowledge, and provide a distribution of parameters rather than a single estimate. This is particularly important when faced with a finite dataset, as shown in the following expression:

p (θ | y, X) \propto p (y | X, θ) p (θ)

(7)

where

p (θ | y, X)

is the posterior probability, i.e., the probability of the parameter θ given the data y and the feature X;

p (y | X, θ)

is the likelihood function, which represents the probability of the data y given the feature X and the parameter θ; and

p (θ)

is the prior probability, which represents the prior probability distribution of the parameter θ.

Random Forest is a decision tree-based machine learning algorithm that makes final predictions by constructing multiple decision trees and voting on their predictions [22]. It is robust, can handle high-dimensional data, assesses the importance of features, and is not prone to overfitting [23]. Its predictions can be expressed by majority voting (classification) or averaging (regression) of the output of individual decision trees:

\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} f_{t} (x)

(8)

where

f_{t} (x)

denotes the prediction of the t tree and T is the number of trees.

Support Vector Machine (SVM) is a supervised learning algorithm mainly used for classification and regression analysis. The core idea is to find a hyperplane in the feature space to maximize the margin between different classes [24]. SVM can solve nonlinear problems by kernel trick. It is suitable for high-dimensional space and is computationally efficient. The expression is as follows:

f (x) = \sum_{i = 1}^{n} α_{i} y_{i} K (x_{i}, x) + b

(9)

where

K (x_{i}, x)

is the kernel function to compute the similarity between two samples;

α_{i}

is the Lagrange multiplier, which is determined by the training process;

y_{i}

is the category labels of the samples; b is the bias term; and n is the number of support vectors.

2.4.2. Deep Learning Methods

Fully Connected Neural Network (Fcnn) is one of the most basic neural network architectures in which all neurons in each layer are connected to all neurons in the next layer. All layers are densely connected and usually contain multiple layers to achieve nonlinear mapping [25]. This network structure is commonly used to process vectorized data as well as fixed-size inputs.

For an Fcnn with l layers, the output of layer

l

can be expressed as follows:

z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)}

(10)

a^{(l)} = σ (z^{(l)})

(11)

where

z^{(l)}

is the weighted input of layer

l

plus the bias,

W^{(l)}

and

b^{(l)}

are the weight matrix and bias vector of layer

l

, respectively; σ is the activation function; and

a^{(l)}

is the activation output of layer

l

.

Autoencoder_Fcnn uses Fcnn to process the features extracted by the autoencoder for final phenotype prediction. Autoencoder_Transformer_Fcnn adds Transformer to Autoencoder_Fcnn to improve the ability to capture complex relationships. Transformer_Fcnn uses Transformer directly to extract features and then makes phenotype predictions through Fcnn.

Convolutional Neural Network (Cnn) is mainly used to process data with lattice structure, such as images or time series, and it is able to automatically learn partial features from input data and construct high-level feature representations through multiple layers of abstraction. Long Short-Term Memory network (Lstm) is a special kind of Recurrent Neural Network which can effectively capture long-term dependencies from time-series data in phenotype prediction by introducing memory units and gating mechanisms to store long-term information. Attention mechanism imitates the way of attention allocation in human visual and cognitive processes, enabling the model to automatically focus on certain important parts of the input data to better capture features associated with specific phenotypes.

In the Cnn_Lstm architecture, the output of the Cnn layer will be used as the input for the Lstm layer. Assuming the Cnn layer’s output feature map size is H × W × C, where H and W are the height and width, and C is the number of channels, these features will be spread or rearranged into a sequence for input into the Lstm [26]. Cnn_Lstm_Attention enables the model to focus on the most important parts of the input sequence by incorporating the Attention Mechanism. In Cnn_Lstm_Attention, the attention mechanism is typically used in the output of the Lstm layer to generate a weighted context vector c, which is computed based on the weights of the Lstm hidden states and the input sequence:

c = \sum_{i = 1}^{N} α_{i} h_{i}

(12)

where

h_{i}

is the hidden state of the LSTM at each time step i, and

α_{i}

is the attentional weight, computed by an additional fully-connected layer that accepts

h_{i}

as an input and produces a weight score, which is then normalized to a probability distribution.

2.5. Model Validation Metrics

We divided the wheat dataset into a training set and a testing set in a 9:1 ratio, using the dataset with the same dimensions to train and test all models. We conducted a performance comparison of three metrics on the testing set without phenotype data. The evaluation metrics for model prediction performance are the Average Relative Error (ARE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (R). The prediction accuracy of models is defined as 1 minus ARE.

ARE is used to measure the relative error between the predicted value and the observed value, taking into account the proportion of the error relative to the true value. It is particularly suitable for comparing different magnitudes of data [27]. The calculation formula is as follows:

A R E = \frac{1}{N} \sum_{i = 1}^{N} |\frac{x_{i} - {\hat{x}}_{i}}{x_{i}}| \times 100 %

(13)

MSE measures the average squared error between predicted and observed values, which is more sensitive to large prediction errors and therefore better captures model bias [28]. The formula is calculated as follows:

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - {\hat{x}}_{i})}^{2}

(14)

R is used to measure the strength and direction of the linear relationship between two variables. In prediction problems, it can be used to assess the degree of consistency between the predicted values and the observed values. The value of R is in the range of (−1, 1). The closer the value is to 1 or −1, the stronger the linear relationship. A positive sign indicates a positive correlation, a negative sign indicates a negative correlation, and 0 indicates no linear relationship. The calculation formula is as follows:

R = \frac{\sum_{i = 1}^{N} (x_{i} - \bar{x}) ({\hat{x}}_{i} - \bar{\hat{x}})}{\sqrt{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}} \sqrt{\sum_{i = 1}^{N} {({\hat{x}}_{i} - \bar{\hat{x}})}^{2}}}

(15)

where N is the number of samples,

x_{i}

and

{\hat{x}}_{i}

are the actual and predicted values of the

i

sample, respectively;

\bar{x}

and

\bar{\hat{x}}

are the average of the actual and predicted values of all the samples, respectively.

3. Results

3.1. Comparison Between Deep Learning Model Training Loss Variation

To better evaluate DeepAT model performance, six typical deep learning models for phenotype prediction, Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention were selected to compare model training loss variation (Figure 2). From the training loss curves, both Autoencoder_Fcnn and Autoencoder_Transformer_Fcnn exhibited relatively flat loss curves, indicating that they converge quickly in the early stages of training, but may have limited expressiveness on complex tasks or face the risk of overfitting. In contrast, the loss curve of the Transformer model showed significant fluctuations, suggesting that this model continuously optimizes during training and possesses good generalization ability. Cnn_Lstm and Cnn_Lstm_Attention experienced a rapid decrease in loss in the early stages of training, followed by stabilization, reflecting their effectiveness in handling sequential data, but also indicating limited optimization space in the later stages of training. The loss curve of the DeepAT model decreased rapidly in the initial phase, followed by considerable fluctuations in the later stages, indicating that the model converges quickly at first and continuously optimizes during training, demonstrating efficient learning ability, but there is a risk of overfitting as training continues. Overall, DeepAT performs excellently in handling data, but the stability of long-term training needs to be improved by increasing the dataset type or adopting more complex optimization strategies for further adjustment and optimization.

3.2. Prediction Accuracy of DeepAT Comparing with Other Methods

In order to test the accuracy and performance of DeepAT, comparative experiments were conducted by adding three currently popular machine learning models (Bayesian Regression, Random Forest, and SVM) to six typical deep learning models (Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention) (Figure 3).

For ARE, compared to deep learning, the ARE for yield prediction using machine learning is larger, with the Random Forest having the highest ARE at 10.46%. The AREs for Cnn_lstm and Cnn_lstm_attention are relatively small, at 0.53% and 0.93%, respectively, while DeepAT has the smallest ARE at 0.02%. In terms of MSE, DeepAT has the lowest MSE (28.93), while the three deep learning methods combined with Fcnn (Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn) have larger MSEs, even exceeding those of the three machine learning methods.

For prediction accuracy, compared to three machine learning models (Bayesian Regression, Random Forest, and SVM), DeepAT’s prediction accuracy is higher by 9.42%, 10.44%, and 9.97%, respectively. Compared to six deep learning models (Transformer, Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Cnn_Lstm, Cnn_Lstm_Attention), DeepAT’s prediction accuracy is higher by 0.81%, 8.36%, 8.36%, 7.81%, 0.51%, and 0.91%, respectively. The results show that the deep learning approach combined with Fcnn performs poorly for yield prediction, with a lower prediction accuracy than the machine learning model. DeepAT outperforms all other models for yield prediction, with the highest accuracy in predicting wheat yield (99.98%), with an MSE of only 28.93 tones and R close to 1.

3.3. Comparison of Correlation Between Yield Observed Values and Predicted Values

The Pearson correlation coefficient (R) can be used to assess the degree of consistency between the predicted values and the observed values. In this study, We compared DeepAT with six typical deep learning models (Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer_Fcnn, Transformer, Cnn_Lstm, and Cnn_Lstm_Attention) and three machine learning models (Bayesian Regression, Random Forest, and SVM) by evaluating the R between the yield predicted values and the observed values to assess the predictive performance of the models (Figure 4).

First, traditional machine learning methods, such as Bayesian Regression, Random Forest, and SVM, demonstrated some predictive capability, but their R was relatively low, indicating a certain degree of deviation between the yield predicted values and the observed values. In contrast, deep learning models, including Autoencoder_Fcnn, Autoencoder_Transformer_Fcnn, Transformer, Transformer_Fcnn, Cnn_Lstm, and Cnn_Lstm_Attention, exhibited stronger predictive abilities, particularly the Transformer, Cnn_Lstm, and Cnn_Lstm_Attention, which had higher R, showing that the yield predicted values were closer to the observed values. DeepAT performed the best among all models, with its R approaching 1, indicating that its yield predicted values closely match observed values. Therefore, DeepAT has significant advantages in phenotype prediction.

4. Discussion

With the increasing amount of omics data, the use of deep learning methods to enhance the predictive ability for complex traits in plant breeding has become a common research approach. In the process of utilizing deep learning to assist in crop genetic breeding, previous studies [29,30,31,32] have developed a series of phenotype prediction models that have made improvements in various aspects, such as predictive performance and computation time. However, there is still room for improvement in capturing the complex relationships between genotype and phenotype, as well as in prediction accuracy. This study proposes a new phenotype prediction method called DeepAT based on deep learning, which mainly includes an input layer, a data feature extraction layer, a feature relationship capturing layer, and an output layer. Compared to other three machine learning methods and six deep learning methods on a public wheat dataset, DeepAT outperforms all other methods and can predict wheat phenotype values based on genotype data. There are several innovations in the following areas:

(1) Feature extraction. DeepAT can extract representative feature vectors from high-dimensional SNP data. By introducing the ReLU activation function, it enhances the model’s ability to express nonlinear features, which helps accelerate the model’s convergence speed.

(2) Dimensionality reduction. DeepAT can handle high-dimensional complex genotype data, reduce computational complexity, and quickly process and analyze data while retaining as much useful information as possible, which helps improve the generalization ability of subsequent models.

(3) Capturing feature relationships. DeepAT effectively captures the complex relationships between features from low-dimensional features using a self-attention mechanism, allowing it to understand the specific locations of SNPs on chromosomes and their impact on phenotypes.

(4) Model stability. DeepAT utilizes the parallel computing capabilities of Transformers to make the model training process more efficient. Compared to traditional RNN structures, it can reach a stable state more quickly, enhancing training efficiency.

DeepAT, by combining the above advantages, can better capture the complex relationship between genotype and phenotype and is able to predict wheat phenotype values based on genotype data, demonstrating higher predictive accuracy than other methods. However, the public datasets used in this study are limited to a single crop species and lack multi-year and multi-site data, which restricts the model’s generalization ability. This may lead to a decline in the ability to make accurate predictions in new environments, affecting the stability of the model’s predictions. Nevertheless, DeepAT still has significant potential for crop phenotype prediction. Future research needs to increase the diversity of training datasets, introduce various crop species, provide data covering multiple environments and time periods, and adopt better algorithms for adjustment and optimization, thereby improving the model’s generalization ability and ensuring the stability of its predictions.

5. Conclusions

The phenotype prediction model DeepAT proposed in this study has significant advantages in predicting wheat yield, achieving a prediction accuracy of 99.98%, with an MSE of only 28.93 tones, and a Pearson correlation coefficient close to 1, indicating that the yield predicted values closely match the observed values. However, yield is a complex quantitative trait [33], and further analysis of multi-year, multi-site, and multi-species datasets is needed to enhance the model’s generalization ability and predictive performance to meet the prediction needs of different scales, crops, and traits.

In the future, the data and models used in this study can be directly integrated into an Online Analysis and Mining Platform for Agricultural Science (http://47.106.253.187/#/, accessed on 16 November 2024) to facilitate the practical application of DeepAT. As an advanced deep learning framework, DeepAT provides an effective tool for phenotype prediction, bringing a new perspective to deep learning-assisted genomic selection, and holds great potential in intelligent breeding.

Author Contributions

Conceptualization, J.L. and J.Z.; methodology, Z.H.; software, Z.H.; validation, Z.H.; formal analysis, Z.H.; investigation, J.L.; resources, J.L.; data curation, J.L. and Z.H.; writing—original draft preparation, J.L. and Z.H.; writing—review and editing, J.Z. and S.Y.; visualization, J.L.; supervision, G.Z., S.Y., and J.Z.; project administration, J.Z.; funding acquisition, G.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Sanya Yazhou Bay Science and Technology City Science and Technology Special Funding (No.SCKJ-JYRC-2023-45); National Key R&D Programme (No.2022YFF0711805, 2022YFF0711801); National Nanfan Research Institute of Chinese Academy of Agriculture Science Southern Propagation Special Project (No.YBXM2409, YBXM2410, YBXM2312, ZDXM2311); Special Project for Basic Research Operating Costs of Central Public Welfare Research Institutes (No.JBYW-AII-2024-05, JBYW-AII-2023-06); the Innovation Project of Chinese Academy of Agricultural Sciences (No.CAAS-ASTIP-2024-AII, CAAS-ASTIP-2023-AII).

Data Availability Statement

The public datasets used in the article were obtained from https://doi.org/10.25909/23949333.v1.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Farooq, M.A.; Gao, S.; Hassan, M.A.; Huang, Z.; Rasheed, A.; Hearne, S.; Prasanna, B.; Li, X.; Li, H. Artificial intelligence in plant breeding. Trends Genet. 2024, 40, 891–908. [Google Scholar] [CrossRef]
Wang, X.; Xu, Y.; Hu, Z.; Xu, C. Genomic selection methods for crop improvement: Current status and prospects. Crop J. 2018, 6, 330–340. [Google Scholar] [CrossRef]
Ahmar, S.; Gill, R.A.; Jung, K.H.; Faheem, A.; Qasim, M.U.; Mubeen, M.; Zhou, W. Conventional and molecular techniques from simple breeding to speed breeding in crop plants: Recent advances and future outlook. Int. J. Mol. Sci. 2020, 21, 2590. [Google Scholar] [CrossRef]
Henderson, C.R. Best linear unbiased estimation and prediction under a selection model. Biometrics 1975, 31, 423–447. [Google Scholar] [CrossRef] [PubMed]
Kage, U.; Kumar, A.; Dhokane, D.; Karre, S.; Kushalappa, A.C. Functional molecular markers for crop improvement. Crit. Rev. Biotechnol. 2016, 36, 917–930. [Google Scholar] [CrossRef] [PubMed]
Meuwissen, T.H.; Hayes, B.J.; Goddard, M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 2001, 157, 1819–1829. [Google Scholar] [CrossRef] [PubMed]
Alemu, A.; Åstrand, J.; Montesinos-López, O.A.; Sánchez, J.I.Y.; Fernández-Gónzalez, J.; Tadesse, W.; Vetukuri, R.R.; Carlsson, A.S.; Ceplitis, A.; Crossa, J.; et al. Genomic selection in plant breeding: Key factors shaping two decades of progress. Mol. Plant 2024, 17, 552–578. [Google Scholar] [CrossRef] [PubMed]
Parveen, R.; Kumar, M.; Swapnil, N.; Singh, D.; Shahani, M.; Imam, Z.; Sahoo, J.P. Understanding the genomic selection for crop improvement: Current progress and future prospects. Mol. Genet. Genom. 2023, 298, 813–821. [Google Scholar] [CrossRef]
Montesinos-López, O.A.; Montesinos-López, A.; Pérez-Rodríguez, P.; Barrón-López, J.A.; Martini, J.W.R.; Fajardo-Flores, S.B.; Gaytan-Lugo, L.S.; Santana-Mancilla, P.C.; Crossa, J. A review of deep learning applications for genomic selection. BMC Genom. 2021, 22, 19. [Google Scholar] [CrossRef]
Liu, J.; Li, J.; Wang, H.; Yan, J. Application of deep learning in genomics. Sci. China Life Sci. 2020, 63, 1860–1878. [Google Scholar] [CrossRef]
Ma, W.; Qiu, Z.; Song, J.; Cheng, Q.; Ma, C. DeepGS: Predicting phenotypes from genotypes using Deep Learning. BioRxiv 2017, 241414. [Google Scholar] [CrossRef]
Wang, K.; Abid, M.A.; Rasheed, A.; Crossa, J.; Hearne, S.; Li, H. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol. Plant 2023, 16, 279–293. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Wang, H.; Wu, S.; Han, B.; Cui, D.; Liu, J.; Zhang, Q.; Xia, X.; Song, P.; Tang, C.; et al. DeepCCR: Large-scale genomics-based deep learning method for improving rice breeding. Plant Biotechnol. J. 2024, 19, 1–3. [Google Scholar] [CrossRef] [PubMed]
Islam, T.; Kim, C.H.; Iwata, H.; Shimono, H.; Kimura, A. DeepCGP: A Deep Learning Method to Compress Genome-Wide Polymorphisms for Predicting Phenotype of Rice. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 2078–2088. [Google Scholar] [CrossRef]
Taylor, J.; Fruzangohar, M.; Walter, J. Roseworthy 2014 Field trial phenotype data and matching 17K+ genotype data. The University of Adelaide. Dataset 2023. [Google Scholar] [CrossRef]
Yan, Q.; Fruzangohar, M.; Taylor, J.; Gong, D.; Walter, J.; Norman, A.; Shi, J.Q.; Coram, T. Improved genomic prediction using machine learning with Variational Bayesian sparsity. Plant Methods 2023, 19, 96. [Google Scholar] [CrossRef]
Song, M.; Greenbaum, J.; Luttrell, I.V.J.; Zhou, W.; Wu, C.; Luo, Z.; Qiu, C.; Zhao, L.J.; Su, K.J.; Tian, Q. An autoencoder-based deep learning method for genotype imputation. Front. Artif. Intell. 2022, 5, 1028978. [Google Scholar] [CrossRef]
Suryawati, E.; Pardede, H.F.; Zilvan, V.; Ramdan, A.; Krisnandi, D.; Heryana, A.; Yuwana, R.S.; Kusumo, R.B.S.; Arisal, A.; Supianto, A.A. Unsupervised feature learning-based encoder and adversarial networks. J. Big Data 2021, 8, 1–17. [Google Scholar] [CrossRef]
Le, N.Q.K. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023, 23, 2300011. [Google Scholar] [CrossRef]
Balabin, H.; Hoyt, C.T.; Birkenbihl, C.; Gyori, B.M.; Bachman, J.; Kodamullil, A.T.; Plöger, P.G.; Hofmann-Apitius, M.; Domingo-Fernández, D. STonKGs: A sophisticated transformer trained on biomedical text and knowledge graphs. Bioinformatics 2022, 38, 1648–1656. [Google Scholar] [CrossRef]
Addy, J.W.G.; MacLaren, C.; Lang, R. A Bayesian approach to analyzing long-term agricultural experiments. Eur. J. Agron. 2024, 159, 127227. [Google Scholar] [CrossRef]
Danilevicz, M.F.; Gill, M.; Anderson, R.; Batley, J.; Bennamoun, M.; Bayer, P.E.; Edwards, D. Plant genotype to phenotype prediction using machine learning. Front. Genet. 2022, 13, 822173. [Google Scholar] [CrossRef] [PubMed]
Tong, H.; Nikoloski, Z. Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data. J. Plant Physiol. 2021, 257, 153354. [Google Scholar] [CrossRef] [PubMed]
van Dijk, A.D.J.; Kootstra, G.; Kruijer, W.; de Ridder, D. Machine learning in plant science and plant breeding. iScience 2021, 24, 101890. [Google Scholar] [CrossRef]
Montesinos-López, O.A.; Montesinos-López, A.; Hernandez-Suarez, C.M.; Barrón-López, J.A.; Crossa, J. Deep-learning power and perspectives for genomic selection. Plant Genome 2021, 14, e20122. [Google Scholar] [CrossRef]
Bhimavarapu, U.; Battineni, G.; Chintalapudi, N. Improved optimization algorithm in LSTM to predict crop yield. Computers 2023, 12, 10. [Google Scholar] [CrossRef]
Huang, F.; Zhang, J.; Zhou, C.; Wang, Y.; Huang, J.; Zhu, L. A deep learning algorithm using a fully connected sparse autoencoder neural network for landslide susceptibility prediction. Landslides 2020, 17, 217–229. [Google Scholar] [CrossRef]
Ren, Y.; Wu, C.; Zhou, H.; Hu, X.; Miao, Z. Dual-extraction modeling: A multi-modal deep-learning architecture for phenotypic prediction and functional gene mining of complex traits. Plant Commun. 2024, 5, 101002. [Google Scholar] [CrossRef]
Sandhu, K.S.; Lozada, D.N.; Zhang, Z.; Pumphrey, M.O.; Carter, A.H. Deep learning for predicting complex traits in spring wheat breeding program. Front. Plant Sci. 2021, 11, 613325. [Google Scholar] [CrossRef]
Li, J.; Zhang, D.; Yang, F.; Zhang, Q.; Pan, S.; Zhao, X.; Zhang, Q.; Han, Y.; Yang, J.; Wang, K.; et al. TrG2P: A transfer learning-based tool integrating multi-trait data for accurate prediction of crop yield. Plant Commun. 2024, 5, 100975. [Google Scholar] [CrossRef]
Lee, H.J.; Lee, J.H.; Gondro, C.; Koh, Y.J.; Lee, S.H. deepGBLUP: Joint deep learning networks and GBLUP framework for accurate genomic prediction of complex traits in Korean native cattle. Genet. Sel. Evol. 2023, 55, 56. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Zuo, S.-M.; Peng, S.; Zhang, H.; Peng, Y.; Li, W.; Xiong, Y.; Lin, R.; Feng, Z.; Li, H.; et al. Development of Machine Learning Methods for Accurate Prediction of Plant Disease Resistance. Engineering 2024, 40, 100–110. [Google Scholar] [CrossRef]
Morales, A.; Villalobos, F.J. Using machine learning for crop yield prediction in the past or the future. Front. Plant Sci. 2023, 14, 1128388. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The proposed DeepAT framework. (a) Dataset sources, (b) genotype data processing, (c) allele encoding, (d) experimental procedure, (e) data feature extraction layer, (f) feature relationship capture layer, (g) DeepAT model architecture.

Figure 2. Training loss variation comparison of DeepAT with the other genotype prediction methods.

Figure 3. Prediction accuracy comparison of DeepAT with the other genotype prediction methods with different evaluation metrics.

Figure 4. Correlation between yield predicted and observed values comparison of DeepAT with the other genotype prediction methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; He, Z.; Zhou, G.; Yan, S.; Zhang, J. DeepAT: A Deep Learning Wheat Phenotype Prediction Model Based on Genotype Data. Agronomy 2024, 14, 2756. https://doi.org/10.3390/agronomy14122756

AMA Style

Li J, He Z, Zhou G, Yan S, Zhang J. DeepAT: A Deep Learning Wheat Phenotype Prediction Model Based on Genotype Data. Agronomy. 2024; 14(12):2756. https://doi.org/10.3390/agronomy14122756

Chicago/Turabian Style

Li, Jiale, Zikang He, Guomin Zhou, Shen Yan, and Jianhua Zhang. 2024. "DeepAT: A Deep Learning Wheat Phenotype Prediction Model Based on Genotype Data" Agronomy 14, no. 12: 2756. https://doi.org/10.3390/agronomy14122756

APA Style

Li, J., He, Z., Zhou, G., Yan, S., & Zhang, J. (2024). DeepAT: A Deep Learning Wheat Phenotype Prediction Model Based on Genotype Data. Agronomy, 14(12), 2756. https://doi.org/10.3390/agronomy14122756

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DeepAT: A Deep Learning Wheat Phenotype Prediction Model Based on Genotype Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Phenotype and Genotype Data

2.2. DeepAT Method for Wheat Phenotype Prediction

2.2.1. Overall Framework of the Model

2.2.2. Data Feature Extraction Layer

2.2.3. Feature Relationship Capture Layer

2.3. Operational Environment

2.4. Methods Used for Comparison

2.4.1. Machine Learning Methods

2.4.2. Deep Learning Methods

2.5. Model Validation Metrics

3. Results

3.1. Comparison Between Deep Learning Model Training Loss Variation

3.2. Prediction Accuracy of DeepAT Comparing with Other Methods

3.3. Comparison of Correlation Between Yield Observed Values and Predicted Values

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI