1. Introduction
Accurate prediction of RNA base pairings from sequence remains a fundamental problem in bioinformatics. While new methods continue to advance the ribonomics research frontier [
1,
2,
3,
4,
5,
6,
7,
8,
9], experimentalists still obtain useful functional insights from the classical minimum free energy (MFE) secondary structure predictions [
10] under the Nearest Neighbor Thermodynamic Model (NNTM). Given the on-going popularity of such predictions, we focus here on characterizing accuracy improvements from one of the smallest possible changes to this approach: modifying only three of over 8000 NNTM parameters [
11]. As will be explained, these three parameters govern the entropic cost of branching, which is a critical aspect of the overall molecular configuration. Moreover, they are some of the few not based on experimental data, and so are reasonable candidates for such a targeted reevaluation. Here, we consider two families of RNA molecules: transfer RNA (tRNA) and 5S ribosomal RNA (rRNA). Their sequence lengths are amenable to our current methods, while providing two different branching configurations to analyze. This enables us to confirm the extent of accuracy improvement possible on a per family basis, while also illustrating the challenges of obtaining such improvements simultaneously over two (or more) families.
An RNA secondary structure is a set of intra-sequence base pairs. For thermodynamic prediction purposes, the target pairings are typically canonical, i.e., Watson-Crick or wobble, and pseudoknot-free. (Removing these constraints, especially the latter, are active areas of research in the field.) Given a secondary structure, its free energy change from the unfolded sequence can be approximated under the NNTM; this model and the evolution of its many parameters (nearly all of which are for the special cases of small internal loops) are cataloged in an online database [
11]. Under the NNTM, when given a sequence, an MFE secondary structure can be computed efficiently using dynamic programming [
12,
13,
14,
15].
Historically, the prediction accuracy is high on average for sequences of length 700 nucleotides or less [
16]. In particular, it was found that an average (with standard deviation) of 83.0% (±22.2) of pairings in 484 transfer RNA (tRNA) secondary structures, totaling 10,018 base pairs (bp) and 37,502 nucleotides (nt), were predicted correctly. Likewise, 77.7% (±23.1) of pairings in 309 5S ribosomal RNA (rRNA) secondary structures, totaling 10,188 bp and 26,925 nt, were predicted correctly. While this clearly supports the value of MFE predictions, it also highlights that even at this scale of sequence lengths, i.e., ∼76 nt for tRNA and ∼120 for 5S rRNA, the prediction accuracy for an individual sequence can be low.
Recent results have demonstrated that it is possible to obtain a statistically significant increase in MFE prediction accuracy on a diverse training set of 50 tRNA sequences and 50 5S rRNA by changing the thermodynamic cost of branching [
17]. Recall that an RNA secondary structure is composed of different substructures, which are scored by different components of the NNTM objective function. The substructures known as multiloops (or branching junctions) have three or more helices which radiate out as branches. A tRNA molecule has one central multiloop with four branches, while 5S rRNA has one with three. Although these branching loops are a critical aspect of the overall molecular conformation, they remain one of the most difficult aspects to predict accurately [
18,
19].
Previously, mathematical techniques from discrete optimization and geometric combinatorics were used to completely characterize all the secondary structures which were optimal for all possible combinations of different branching parameters on the chosen training sets [
17]. It was found that 89% of tRNA and 90% of 5S rRNA predictions could be improved by altering the branching parameters from the default values. (Those which did not improve already had an accuracy well above average.) Critically, though, achieving this improvement
simultaneously, i.e., for the same set of new parameters, is not possible; the intersection of all the “best possible” set of parameters for each sequence is empty.
At the time, an ad hoc combinatorial method was used to identify large combinations of non-empty intersections among the individual “best possible” sets for a given collection of training sequences. It was demonstrated that the branching parameters obtained in this way yielded a statistically significant improvement in MFE prediction accuracy over the existing values on tRNA, on 5S rRNA, and on the total 100 sequence training set, respectively. However, a significant gap remained between the average (known to be unobtainable) of the maximum attainable individual accuracies with modified branching parameters, and the best ad hoc values found.
The new method and associated results presented here eliminate that gap by giving a branch-and-bound algorithm, and an effective implementation, for finding parameters with the optimal MFE prediction accuracy across the given collection of RNA sequences. Somewhat surprisingly, we find that the improvement in prediction accuracy between the new branch-and-bound (BB) parameters and previous ad hoc (AH) ones is
not statistically significant. To test this conclusion, we computed the prediction accuracies for the different branching parameters under consideration on a much larger set of 557 tRNA and 1283 5S rRNA sequences [
20] from the Mathews Lab (U Rochester). We again saw no significant difference between the AH and BB parameters in MFE prediction accuracy for the testing sequences.
Moreover, we also confirmed that the new parameters give a statistically significant improvement over the current ones on the type of testing sequences, i.e., tRNA, 5S rRNA, or both, for which the parameters were trained. In conjunction, these results suggest that there may be a relatively large set of branching parameters which yield equivalent prediction accuracies that improve over the current ones. To move forward in identifying the scope of these parameters, we confirm that the current empirical strategy of focusing on the trade-off between two of the three branching parameters is well-substantiated by our analysis. This is especially useful as such a reduction in the dimension will enable the approach to be applied to longer sequences than those considered here.
3. Results
We first address the extent of improvement possible when the branching parameters are trained on a specific family, either tRNA or 5S rRNA. We then consider the improvement possible across both training families simultaneously. The training results are tested against a much larger set of 1840 sequences, and the conclusions are found to hold. Importantly, this demonstrates that the method is not overfitted to the training data. Finally, the relevance of parameter precision to the results is addressed.
3.1. Improving Per Family Prediction Accuracy
Previously [
17], mathematical methods were used to find the best of all possible combinations of branching parameters
for each individual training sequence. Averaging these per sequence accuracies over their family, or the whole training set, yields the “max” accuracy listed in
Table 2. These numbers are an upper bound on the improvement possible by modifying the branching parameters for MFE prediction accuracy on these training (sub)sets. We note that the upper bound is known not to be achievable, since the common intersection (even within a family) of the best per sequence parameter values is empty.
A lower bound on the improvement was established [
17] by ad hoc means, that is by identifying large sets of sequences from each family which did have such a common intersection. Seven such large sets were found for tRNA and four for 5S rRNA. These combinations of possible parameters were considered for each family, and the one with best average accuracy over the whole family was reported. We refer to these parameter combinations here as AHt and AHs, for the tRNA and 5S rRNA families respectively. It was found that the AHt and AHs accuracy on their respective family, listed in
Table 2, was a statistically significant improvement over the T99 parameters.
However, there was a sizable gap between the upper and lower bounds for the potential accuracy improvements for each family considered. The branch-and-bound algorithm presented here was implemented to determine where in this range the best possible per family accuracy lay. The new parameters are listed in
Table 2 as BBt and BBs along with the corresponding accuracies.
As a technical aside, the best possible region for 5S is unbounded in the direction. The BBs parameters reported are the centroid of the finite 2D face. We also considered a strictly interior point BBs, which gave nearly identical accuracy for tRNA. Interestingly, this new point is very close to AHs, with a distance of (0.2, 0.1, 0).
It is worth noting that the signs of these family-optimal parameters are consistent within the family between the ad hoc and branch-and-bound combinations, but reversed between the two families. This suggests that combinations which are optimal for different configurations, i.e., a tRNA cloverleaf versus a 5S rRNA Y-shape, may occupy different parts of the 3D parameter space. It is also relevant to future analyses that the range of the b parameters is very narrow (from to ) and distributed fairly evenly around 0.
As evaluated by a two-sample t-test, the differences between the ad hoc and branch-and-bound parameter accuracies within the family on which the parameters were optimized are not significant, with for tRNA and for 5S rRNA. In other words, the maximum possible accuracy per family over the training set sequences is essentially the best ad hoc accuracy, and far from the average over the per sequence maximums. This closes the gap from the previous analysis, while opening the door to new questions as discussed in the next section.
3.2. Improving cross Family Prediction Accuracy
It was observed before that the accuracy of the new AHt and AHs parameters on the other family is clearly worse than the Turner parameters. Consequently, a “best both” combination, denoted here AHb, was also identified from among the 11 possible ad hoc ones considered, 7 for tRNA and 4 for 5S rRNA. As evaluated by a two-sample t-test, the improvement over the T99 parameters on the whole training set accuracy is significant (). This is due entirely to tRNA; the AHb parameters raise the tRNA accuracy by an amount statistically indistinguishable (t-test ) from AHt, while the AHb accuracy for the 5S rRNA training sequences was indistinguishable (t-test ) from T99. This is interesting because it demonstrates that, in terms of trade-offs in prediction accuracy between tRNA and 5S rRNA, the former can be improved without negatively affecting the latter. We note that only per family ad hoc parameters were considered at the time, so it seemed plausible that a better parameter combination could be found when both tRNA and 5S rRNA were considered concurrently.
However, this turned out not to be the case when the branch-and-bound algorithm was run on the full 100 sequence training set. The parallelization is implemented in such a way that three combinations of “near optimal” parameters were detected, along with the absolute best one. When the parameters are rounded to 1 decimal precision, 3 of the 4 combinations give identical 0.66 average accuracy when rounded to 2 decimal places, and the fourth is only 0.02 less. Given this, we report as the BBb parameters the combination with 0.66 average accuracy which was most dissimilar to AHb. The other 2 combinations were AHb and , although AHb was originally optimized for tRNA and not the full training set. The fourth was more similar to BBb but the combination was noticeably worse for 5S by 0.05 with only a 0.01 improvement for tRNA.
These results illustrate that, while it is possible to improve over the current T99 prediction accuracy by a statistically significant amount, doing so simultaneously over two different families is more challenging than improving the per family accuracy. Additionally, the best possible accuracy over the full training set is the same as the previous ad hoc one.
3.3. Results for Testing Data
We note that the summary statistics for “Both” families in
Table 3 were computed as a weighted average and standard deviation, so that each family contributed 50% to the statistic although there were more than twice as many 5S rRNA as tRNA sequences in the testing set. With a total of 1840 sequences, the difference in accuracy over the whole data set for the T99 parameters versus the AHb or BBb is significant; the Tukey HSD Post-hoc Test following an ANOVA were both
while the differences between AHb and BBb were not (
). The AHb and BBb parameters performed just as well on the tRNA testing sequences as on the training ones (ANOVA
), although the 5S accuracy was less good (ANOVA
). A Tukey HSD Post-hoc Test found significant differences between the AHb training accuracy and the testing one (
) as well as the BBb testing one (
) for the 5S family, but the other four pairwise comparisons were indistinguishable. We also note that the parameters trained on 5S sequences have lower accuracy on the testing data.
Other than this lower accuracy for 5S rRNA, several patterns observed in
Table 2 also hold in
Table 3. First, parameters which were trained on one family are markedly less accurate on the other one. When appropriately combined to weigh each family equally, this yields overall accuracies comparable to the Turner values. However, parameters chosen to raise the overall accuracy can achieve a statistically significant improvement over T99. Finally, branch-and-bound and ad hoc accuracies are remarkably similar over the families on which they were trained.
3.4. Sensitivity to Parameter Precision
As discussed in
Section 2, the accuracy computations here focused on branching parameters specified to one decimal precision. We note that the “exact” parameters computed, that is ones specified as a rational number to the maximum precision allowable by the size of an integer in the computer algebra system used, always gave higher accuracy on the training (sub)set used. The largest difference seen was less that
, which is not a statistically significant difference over data sets of this size with this amount of variance in the accuracy means.
4. Discussion
Previous results [
17] demonstrated that it was possible to achieve a statistically significant improvement in MFE prediction accuracy by altering the three NNTM parameters which govern the entropic cost of loop branching. This was shown on a set of 50 tRNA sequences, on another of 50 5S rRNA, and on the full training set of both families combined. However, the extent of the possible improvement was unknown, although a lower bound was given by the ad hoc parameters identified— listed as AHt, AHs, and AHb respectively in
Table 2. The “max” upper bound, known not to be attainable, was provided by the average maximimum accuracy (over any combination of parameters) for each individiual training sequence. Hence, the open questions was to establish the maximum
simultaneous improvement over these training sets.
Here we provide a branch-and-bound algorithm which takes as input a set of RNA branching polytopes [
23] and finds the parameters with the best possible accuracy over the entire set. We describe implementation details needed to insure that the basic algorithm runs efficiently enough to be useful in practice, and give results on our original training set as well as a much larger testing set available from the Mathews Lab (U Rochester) with 557 tRNA and 1283 5S rRNA. The differences in MFE accuracy under the standard T99 parameters between the training and testings sequences are not significant, and we find that the general trends observed in the training data are borne out by the testing results.
First, and most surprising, we find that the branch-and-bound parameters do not improve on the ad hoc ones in any significant way. Hence, we now know that the best possible MFE prediction accuracy for the tRNA training sequences is 0.75 on average and 0.73 for 5S rRNA. The testing data achieves a comparable accuracy for tRNA, although the ad hoc AHt parameters are actually slightly better than the branch-and-bound BBt. The AHs and BBs accuracies for the 5S rRNA testing sequences are equivalent, if lower than the training ones (but not by a statistically significant amount).
Overall, the average MFE prediction accuracy for the 5S rRNA testing sequences is consistently lower, by ∼5%, than the training accuracy for all parameter combinations considered. It is not obvious why this should be the case, since the GC content is equivalent and the length and number of native base pairs were actually lower on average for the testing sequences. In the future, it may be worthwhile to investigate what other sequence and/or structural characteristics might correlate with this training versus testing trend for 5S rRNA.
Returning to the accuracy improvement question, the branch-and-bound results for each family establish a much more realistic upper bound than the previous “max” values from
Table 2. However, we also know that achieving this level of accuracy simultaneously across the two families is not possible. The testing data reinforces the point that parameters which are optimized only for one family perform much less well for the other. For the family-specific parameters, i.e., AHt, AHs, BBt, and BBs, this yields combined accuracies over both families which are on par with the Turner parameters for both training and testing data. In the case of the AHt paramaters, the improvement over T99 for the full testing set is statistically significant (
) but not for the training set (
).
In contrast, when the parameters are chosen to maximize the accuracy of both families, a statistically significant improvement over T99—which has the highest combined accuracy over the three Turner parameter sets—is achieved for both testing and training data. As described for the training results, there are multiple different parameter combinations that achieve essentially the same accuracy over both families. This conclusion holds true for the testing data as well. Such stability in the maximum combined accuracy strongly suggest that future studies should focus on “near optimal” parameter combinations. This is particularly appropriate when the parameters are commonly specified to 1 decimal precision.
In general, we find that the AHb and BBb parameters are better for tRNA than for 5S rRNA, relative to the family-specific parameters. It is interesting to note that the Turner parameters with the highest 5S rRNA accuracy are the earliest ones which had , whereas the T99 and T04 parameters both have . The parameters trained only on 5S rRNA which produce the highest accuracy on that family, i.e., AHs and BBs, both had while the opposite was true for tRNA. However, it was possible to achieve the same accuracy (to two decimal places) over the entire 100 sequences training set with either or , and essentially the same for the 1840 sequence testing set.
Over all the new parameter combinations considered, we found a small range of
b values, roughly centered around 0. Furthermore, previous results [
17] had found that the thermodynamic optimization was most sensitive in the
b direction. In future work, we expect to specialize the
b value in our parametric analysis to better focus on the
a and
c trade-offs. We have preliminary results which indicate that this is not too detrimental to the overall accuracy, and this reduction in complexity will certainly improve the time and memory needed to compute the branching polytopes.
Recall that a weighs the number of multiloops while c scores the total number of branching helices. In terms of trade-offs between them, we note that the most recent Turner parameters have a significant increase in a and a for the first time. All six new parameter combinations have opposite sign for a and c suggesting that there may be an important reward/penalty balance to be achieved. Not only did the 5S-specific parameters have but they also both have and , whereas the tRNA-specific had the exact opposite signs. Although the same accuracy could be achieved over the whole training data set with b either positive or negative, there were no “near optimal” parameter combinations found which had and . Hence, it seems that there may be a greater range of combinations which produce an acceptable 5S rRNA accuracy than there is for tRNA. In fact, BBb is quite close to T04, with a difference of . This small change produces a considerable improvement in tRNA accuracy, with a smaller corresponding decrease for 5S rRNA.
In conjunction, these results suggest that there may be a relatively large set of branching parameters which yield equivalent prediction accuracies that improve over the current ones. To make progress on identifying the scope of these parameters, we confirm that the current empirical strategy of focusing on the trade-off between the a and c parameters is well-substantiated by this analysis. Moving forward, the goal is to expand the training set to include other RNA families, but this requires new algorithmic approaches to computing the branching polytopes. To date, the longest sequence attempted (an RNase P of length 354 nt) took more than 2 months. However, focusing on the a,c trade-offs only should reduce the complexity substantially, yielding further insights into the advances possible and limitation faced when improving RNA branching predictions.
5. Conclusions
We conclude by putting the results reported here into a broader context. In demonstrating that a targeted reevaluation of the NNTM branching parameters can improve MFE prediction accuracy, we also highlight the challenges of doing so across more than one family simultaneously. This is relevant not only to thermodynamic methods for RNA secondary structure prediction, but also to machine learning approaches which require large enough training sets to parameterize all the new structural features proposed. It has been demonstrated that the risk of overfitting to the training data is real [
26], but also that using thermodynamic information in a machine learning method can reduce the problem while still improving prediction accuracy [
27].
While the importance of training across different RNA families is well-known, to the best of our knowledge, this is the first time that the resulting parameters have been explicitly compared when trained on one family versus another, and then on both. In this way, we are able to characterize the effect on the maximum possible per family accuracy—which is now known—when optimizing the branching parameters over both of our training families. By focusing only on two families, this interplay can be clearly analyzed. We see an explicit trade-off between tRNA and 5S rRNA, and this suggests that it may be worthwhile to consider the trade-offs being made by other methods. In doing so, it is important that the training and testing data sets be balanced, i.e., where each family contributes equally to the accuracy. Otherwise, improvements in the dominant family will disproportionally affect the combined statistic.
Moving forward, we plan to extend this approach to analyze the trade-offs in branching parameters among additional families. As a first step in that direction, we did consider the prediction accuracy under the ad hoc and branch-and-bound parameters for each of the other eight families from the Mathews data set [
20,
22,
28]. For half of these families (small subunit ribosomal RNA domains, large subunit ribosomal RNA domains, group I self-splicing introns, and group II self-splicing introns), an ANOVA on the average family accuracy found no significant differences among T99, T04, AHb, and BBb. Hence, even though the new parameters were trained on two entirely different families, this did not negatively affect the accuracy. In the four families where there was a significant change (
) in the ANOVA, two improved while two worsened—again highlighting the trade-offs in branching parameters between families. Interestingly, the former (RNase P and tmRNA) were “tRNA-like” in the sense that they performed as well under the AHt parameters (although not the BBt) whereas the latter (signal recognition particle RNA and telomerase RNA) were “5S rRNA-like” in the sense that they performed as well on AHs and BBs as on T99 and T04.
In summary, analyzing the trade-offs in MFE prediction accuracy improvements possibly by modifying the NNTM branching parameters may yield improvements more generally. At the very least, it will characterize why obtaining such improvements simultaneously over different RNA families with distinct branching configurations has remained a challenge.