Abstract
Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method. Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type.
Similar content being viewed by others
Notes
In this article, we call an approach probabilistic if it makes no use of free energy-based models; even if a PF-based Boltzmann sample is a random event, we accordingly do not assume it probabilistic.
A structurally ambiguous SCFG mirror of modern energy-based algorithms for single sequence structure prediction has already been described in Rivas and Eddy (2000).
Note that in order to avoid ambiguity, we will denote a particular sequence by r and the corresponding secondary structure by s in the sequel.
To ensure that a SCFG gets consistent one can, e.g., assign relative frequencies to the productions, which are computed by counting the production rules used in the leftmost derivations of a finite sample (RNA database) of words from the generated language (Chaudhuri et al. 1983).
Here, we decided to consider any possible pair as valid base pair, where non-canonical ones are mostly prohibited due to small probabilities. Thus, in contrast to the thermodynamics-based PF approach which can only handle canonical base pairs, our algorithm is able to deal with arbitrary base pairs, in a convenient way: when using appropriate probabilities, canonical base pairs will be very likely and non-canonical ones will be very unprobable (but not necessarily impossible) to be formed. However, since non-canonical base pairs are usually not permitted in secondary structure models (to limit the number of possible foldings), it would also be adequate to allow only canonical ones. The probabilities for non-canonical base pairs would then be equal to zero.
Note that this separation into transition and emission probabilities corresponds to the standard treatment applied in hidden Markov models.
All references starting with Sm are references to the Supplementary material available at http://wwwagak.cs.uni-kl.de/research/publications/.
Both methods can be implemented to run in \({{\mathcal O}(n^{3})}\) time and with \({{\mathcal O}(n^{2})}\) space requirements for a sequence of length n, where a single secondary structure can be drawn in \({{\mathcal O}(n^{2})}\) time.
Note that the positive predictive value is often called specificity, like, for example, in Do et al. (2006), which will be extensively referenced in the sequel.
It should be clear that this formula would only be correct for \(m_s = 1.\) For choices of \(m_s > 1,\) however, it would only yield approximate results.
Note that the most probable structure is assumed to be (nearly) the MFE structure when sampling is realized via PFs.
TN is the number of base pairs which were correctly predicted as non-matching (true negatives).
In order to perform a k-fold cross-validation, k ≥ 2, on the basis of a given probabilistic model and a set of real-world data, we first have to partition the data randomly into k approximately equal-sized subsets (“folds”). Then, for any \(i \in \{ 1, \ldots, k \},\) we must estimate the model parameters from all objects that are not contained in fold i (training set) and validate the results obtained for all objects that actually belong to fold i (benchmark set). The corresponding result of the cross-validation process is then the average of the results derived for the different folds i, 1 ≤ i ≤ k.
This sample size has proven to be adequate for most applications, as even for a huge set of possible secondary structures of a given sequence, a sample of only 1,000 structures can yield statistical reproducibility of typical sampling statistics, even if samples can be entirely different (see Ding and Lawrence 2003).
Note that in the few cases where the PF approach yields more different shapes, we generally further restricted the possible structures by prohibiting isolated base pairs \((\hbox{min}_{\rm hel}> 1),\) which are in fact allowed in PF calculations.
References
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424
Chaudhuri R, Pham S, Garcia ON (1983) Solution to an open problem on probabilistic grammars. IEEE Trans Comput C 32(8):748–750
Ding Y (2006) Statistical and bayesian approaches to RNA secondary structure prediction. RNA 12:323–331
Ding Y, Lawrence CE (2003) A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 31(24):7280–7301
Ding Y, Chan CY, Lawrence CE (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 32:W135–W141
Ding Y, Yu Chan C, Lawrence CE (2005) RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA 11:1157–1166
Dirks RM, Pierce NA (2003) A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem 24:1664–1677
Dirks RM, Pierce NA (2004) An algorithm for computing nucleic acid base-pairing probabilities including pseudoknots. J Comput Chem 25:1295–1304
Do CB, Woods DA, Batzoglou S (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22(14):e90–e98
Dowell RD, Eddy SR (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinform 5:71
Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Res 2(11):2079–2088
Fu KS, Huang T (1972) Stochastic grammars and languages. Int J Comput Inform Sci 1(2):135–170
Giegerich R, zu Siederdissen R (2011) Semantics and ambiguity of stochastic RNA family models. IEEE/ACM Trans Comput Biol Bioinform 8:499–516
Giegerich R, Voß B, Rehmsmeier M (2004) Abstract shapes of RNA. Nucleic Acids Res 32(16):4843–4851
Goodman JT (1998) Parsing inside-out. PhD thesis, Harvard University, Cambridge, MA
Goodman J (1999) Semiring parsing. Comput Linguist 25(4):573–605
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003) Rfam: an RNA family database. Nucleic Acids Res 31(1):439–441
Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33:D121–D124
Hamada M, Kiryu H, Sato K, Mituyama T, Asai K (2009) Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics 25(4):465–473
Hofacker IL (2003) The Vienna RNA secondary structure server. Nucleic Acids Res 31(13):3429–3431
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Schuster P (1994) Fast folding and comparison of RNA secondary structures (the Vienna RNA package). Monatsh Chem 125:167–188
Huang T, Fu KS (1971) On stochastic context-free languages. Inform Sci 3:201–224
Janssen S, Reeder J, Giegerich R (2008) Shape based indexing for faster search of RNA family databases. BMC Bioinform 9:131
Knudsen B, Hein J (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6):446–454
Knudsen B, Hein J (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 31(13):3423–3428
Mathews DH, Sabina J, Zuker M, Turner DH (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 288:911–940
McCaskill JS (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29:1105–1119
Nebel ME, Scheid A (2009) On quantitative effects of RNA shape abstraction. Theory Biosci 128(4):211
Nebel ME, Scheid A, Weinberg F (2011) Random generation of RNA secondary structures according to native distributions. Algorithms Mol Biol 6(1):24
Nussinov R, Jacobson AB (1980) Fast algorithms for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci USA 77(11):6309–6313
Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ (1978) Algorithms for loop matchings. SIAM J Appl Math 35:68–82
Reeder J, Giegerich R (2004) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinform 5:104
Rivas E, Eddy SR (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol 285:2053–2068
Rivas E, Eddy SR (2000) Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 6:583–605
Rivas E, Lang R, Eddy SR (2011) A range of complex probabilistic models for RNA secondary structure prediction that include the nearest neighbor model and more
Rozenski J, Crain PF, McCloskey JA (1999) The RNA modification database. Nucleic Acids Res 27:196–197
Ruan J, Stormo GD, Zhang W (2004) An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics 20(1):58–66
Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S (1998) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 26:148–153
Steffen P, Voß B, Rehmsmeier M, Reeder J, Giegerich R (2006) RNAshapes 2.1.1 manual
Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J (2002) 5s ribosomal RNA database. Nucleic Acids Res 30:176–178
Viennot G, Vauchaussade De Chaumont M (1985) Enumeration of RNA secondary structures by complexity. Math Med Biol. Lect Notes Biomath 57:360–365
Waterman MS (1978) Secondary structure of single-stranded nucleic acids. Adv Math Suppl Stud 1:167–212
Weinberg F, Nebel ME (2011) Applying length-dependent stochastic context-free grammars to RNA secondary structure prediction. Algorithms 4(4):223–238
Wuchty S, Fontana W, Hofacker I, Schuster P (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49:145–165
Xia T, SantaLucia J Jr, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 37:14719–14735
Zuker M (1989) On finding all suboptimal foldings of an RNA molecule. Science 244:48–52
Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31(13):3406–3415
Zuker M, Stiegler P (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9:133–148
Zuker M, Mathews DH, Turner DH (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In: Barciszewski J, Clark BFC (eds) RNA biochemistry and biotechnology. NATO ASI series. Kluwer Academic Publishers, Dordrecht, pp 11–43
Acknowledgments
The authors wish to thank two anonymous reviewers for their careful and helpful remarks and suggestions made for a previous version of this article.
Author information
Authors and Affiliations
Corresponding author
Additional information
The research of Anika Scheid is supported by Carl-Zeiss foundation.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Nebel, M.E., Scheid, A. Evaluation of a sophisticated SCFG design for RNA secondary structure prediction. Theory Biosci. 130, 313–336 (2011). https://doi.org/10.1007/s12064-011-0139-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12064-011-0139-7