4.1. Prediction Performance of DRANetSplicer
To evaluate the performance of DRANetSplicer, we computed the six performance metrics as presented in
Table 3. In addition, we provided the Area Under the ROC Curve (AUC). The predictive results of DRANetSplicer for three different organisms are shown in
Table 4, please see
Supplementary Materials Table S1 for the results of each individual experiment.
On the donor splice site dataset, DRANetSplicer achieved an average accuracy of 96.57%, an average error rate of 3.43%, and an average AUC of 99.05%. The average values of precision, sensitivity, specificity, and F1 score metrics all exceeded 96%. On the acceptor splice site dataset, DRANetSplicer obtained an average accuracy of 95.82%, an average error rate of 4.18%, and an average AUC of 98.86%. The average values of precision, sensitivity, specificity, and F1 score metrics all exceeded 95%. These performance metrics indicate that DRANetSplicer exhibits excellent predictive performance across datasets from different species.
4.3. Comparison with Benchmark Methods
To provide a more comprehensive evaluation of DRANetSplicer’s performance, we compared it with other state-of-the-art prediction models, including four outstanding deep learning methods: SpliceFinder [
17], Splice2Deep [
30], Deep Splicer [
10], EnsembleSplice [
11], and DNABERT [
33].
As shown in
Table 6, the DRANetSplicer model outperforms the benchmark methods in terms of accuracy and F1-score on both the donor and acceptor datasets for all three biological species. A comparison of the performance of each individual experiment with the benchmark method is shown in
Supplementary Materials Table S3. In the donor splice site dataset, DRANetSplicer achieves the best average accuracy of 96.57%, resulting in average error rate relative reductions of 28.4%, 58.8%, 48.1%, 19.1%, and 4.2% compared to SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, respectively. In the acceptor splice site dataset, DRANetSplicer attains the highest average accuracy of 95.82%, resulting in average error rate relative reductions of 39.2%, 58.1%, 64.4%, 23.7%, and 11.6% compared to the same benchmark models. Across both donor and acceptor splice sites, DRANetSplicer consistently reduces the relative error rate, with at least a 4.2% and 11.6% relative reduction in average error rates compared to the benchmark models.
In
Table 6, we also include the harmonic mean of sensitivity and specificity, which is the F1-score, reflecting the overall performance in correctly predicting splice sites. DRANetSplicer achieves an average F1-score of 96.54% for donor splice sites and 95.80% for acceptor splice sites, whereas the second-ranking model, DNABERT, achieves average F1-scores of 96.42% and 95.27%, respectively. This means that the average F1-score error (1 − F1-score) is relatively reduced by 3.4% and 11.2% for donor and acceptor splice sites, respectively. Additionally, we have plotted the ROC curves for DRANetSplicer and other benchmark models, as shown in
Figure 4. From the curves, it can be observed that the performance of the DRANetSplicer model is excellent.
Importantly, in comparison to DNABERT, which represents the outstanding gene prediction model pretrained on DNA sequences with splice site prediction as one of its downstream tasks, DRANetSplicer significantly reduces the model’s parameter and computational complexity while ensuring predictive performance. Specifically, DNABERT has 99.95 M parameters (Params) and requires 2.56 G of computation (FLOPs), whereas DRANetSplicer has 2.66 M parameters and requires 0.34 G of computation. This makes DRANetSplicer more efficient than DNABERT and more suitable for deployment in production environments.
4.4. Cross-Organism Validation
The results of cross-organism validation are shown in
Table 7. Experimental results for each cross-organism validation are provided in
Supplementary Materials Table S4. When tested on
Oryza, the accuracy of DRANetSplicer for both donor and acceptor sites trained on
Oryza and
Arabidopsis is (97.21%, 96.30%) and (93.58%, 93.40%), respectively. When tested on
Arabidopsis, the accuracy for donor and acceptor sites trained on
Arabidopsis and
Oryza is (95.63%, 94.91%) and (94.89%, 94.25%), respectively. When tested on
Homo, the accuracy for donor and acceptor sites trained on
Oryza and
Arabidopsis is (82.95%, 80.17%) and (78.00%, 75.99%), respectively. We observe that cross-organism validation between
Oryza and
Arabidopsis yields better results, while cross-organism validation between
Homo and the other two species is less effective.
Combining the experimental results from
Table 6, we notice a phenomenon where DRANetSplicer, when trained on one species and tested on another (cross-organism validation), achieves higher accuracy compared to the non-cross-organism validation of benchmark models. For instance, for donor sites, DRANetSplicer trained on
Oryza and tested on
Arabidopsis achieves an accuracy of 94.89%, whereas SpliceFinder, Splice2Deep, Deep Splicer, and EnsembleSplice trained and tested on
Arabidopsis achieve accuracies of 94.43%, 88.91%, 91.42%, and 94.43%, respectively. Bold data in
Table 7 indicate that the cross-organism predictions of DRANetSplicer have higher accuracy than the non-cross-organism predictions of benchmark models. The analysis suggests that DRANetSplicer exhibits strong generalization capabilities, particularly in cross-organism validation between similar organisms. The cross-organism validation between
Oryza and
Arabidopsis demonstrates superior results compared to cross-organism validation on
Homo and the other two species.
Based on these findings, we conclude that DRANetSplicer is applicable to a broader range of organisms and can transfer well from one species dataset to related species with limited research. Therefore, the DRANetSplicer model is highly valuable for genome annotation work on less-studied organisms.
4.5. Interpretability
We computed the average absolute weighted contribution scores for each nucleotide position learned by the
Oryza donor and acceptor models. The results were compared with the sequence conservation calculated by WebLogo for the positive dataset, as shown in
Figure 5a. We observed a high similarity between the average absolute weighted contribution scores for each nucleotide position learned by the model and the sequence conservation at that position. The cosine similarities between the two in the
Oryza donor and acceptor datasets were 0.93 and 0.90, respectively. This suggests that the model can effectively learn known gene motif patterns during the training process. Furthermore, the conclusion drawn from
Figure 5a is that nucleotides around the splice sites have the most significant average impact on the model’s prediction results, and regions around the splice sites are generally more crucial than the marginal regions. The heatmap of importance for each nucleotide position learned by the donor and acceptor models, as shown in
Figure 5e, confirms the correctness of this observation. This observation is consistent with the findings of previous studies by Jasper Zuallaert et al. [
9] and Julie D. Thompson et al. [
16].
In
Figure 5b, we conducted a detailed investigation of the highlighted regions in the
Oryza donor and acceptor models (marked in
Figure 5a). We compared the individual average weighted contribution scores for the four nucleotides at each nucleotide position with the known gene motifs visualized by WebLogo. Within the studied regions, we observed that the most conserved nucleotide at each position, according to WebLogo, consistently corresponded to the nucleotide with the highest average weighted contribution score at that position in our model. Additionally, the proportions of average weighted contribution scores for different nucleotides at each position were similar to the proportions of nucleotide frequencies calculated by WebLogo.
According to the study by Amit et al. [
46], the G + C content of exons in the
Arabidopsis genome is significantly higher than that of introns. To validate whether our model learned this biological feature during training, we plotted the average weighted contribution scores for G + C and T + A at each nucleotide position in the
Arabidopsis donor and acceptor models, as shown in
Figure 5c. The plot illustrates that the upstream region (exons) of the donor splice site, indicated by positive average contribution scores for G + C and negative scores for T + A, exhibits the opposite trend in the downstream region (introns). Similarly, the downstream region (exons) of the acceptor splice site shows positive average contribution scores for G + C and negative scores for T + A, while the upstream region (introns) displays the opposite pattern. This analysis indicates that our model can automatically learn these biological features to distinguish between exons and introns.
In eukaryotic organisms, the splicing process is conserved, involving not only direct splicing at the donor and acceptor splice sites but also the participation of feature sequences such as branch points (BP) and polypyrimidine tracts (PPT). The PPT, rich in pyrimidine bases are located between the BP and the acceptor splice site, with the BP positioned several dozen nucleotides upstream of the PPT [
47]. From
Figure 5a, we observe that the
Oryza acceptor model distinctly learns the PPT sequence features preceding the acceptor splice site, with the average weighted contribution scores for T + C in this region being positive. Iwata et al.’s study revealed a strong negative correlation between PPT and BP signals in eukaryotes, with a strong positive correlation with the acceptor splice site signal [
47]. In
Figure 5d, we investigate the impact of CTNA and PPT on splicing in the
Homo acceptor model, where CTNA is a reported typical BP motif [
47].
We observe a negative correlation between the average weighted contribution scores of CTNA and PPT in the region several dozen bases upstream of the acceptor splice site. Additionally, there is a small peak in the average weighted contribution scores of CTNA in a small region just before the rise in the average weighted contribution scores of PPT (highlighted in
Figure 5d). Furthermore, we notice that as we approach the acceptor splice site, the average contribution scores of CTNA become smaller and negative, while PPT exhibit the opposite trend. These observations confirm that our model can automatically learn the biological features of BP and PPT.
Research by Gooding et al. suggests a unique biological feature upstream of the acceptor splice site, known as the AG exclusion zone [
48]. To verify the model’s learning of this biological feature, we studied the average contribution scores of the AG motif in the
Homo acceptor model, as shown in
Figure 5d. We observe a sharp decrease and negative values in the average weighted contribution scores of the AG motif in a region just before the acceptor splice site.
Figure 5e displays the importance of heatmaps for each nucleotide position in the donor and acceptor models. In the heatmap, it is observable that nucleotides around splice sites have the greatest average impact on the model’s predictive results. Moreover, the broader regions around splice sites are generally more important than the marginal regions. The heatmaps of the three biological models mutually affirm the correctness of these observations, consistent with the findings in
Figure 5a. Notably, we observe a high similarity between the heatmaps of
Oryza and
Arabidopsis models, indicating a high similarity in the nucleotide regions crucial for decision-making in the model learning process for these two organisms. However, there are substantial differences in the decisive nucleotide regions between the
Homo model and the models of
Oryza and
Arabidopsis.
Our observation results from the nucleotide position heatmaps validate that the cross-species validation experiment results are accurate. The cross-species validation between Oryza and Arabidopsis performs better than the cross-organism validation involving Homo. This underscores that similar organisms have a high degree of similarity in the decision features learned during model training. From a model interpretation perspective, we further emphasize that DRANetSplicer possesses the capability for cross-species prediction of splice sites.
In summary, we have demonstrated that the DRANetSplicer model can learn known biological features, indicating its ability to automatically learn biological features from gene sequences. However, we need to further enhance model interpretability to achieve the goal of providing biological insights that human experts have not described yet, rather than being just a black-box classifier. This represents a major challenge and is a direction for our future work.