Date of Award


Degree Name

Master of Science


Computational Science


Ming-Ying Leung


Ribonucleic acid (RNA) molecules and their secondary structures play important roles in many biological processes including gene expression and regulation. The genomes of many viruses are also RNA molecules. Since secondary structures are crucial for RNA functionality, computational predictions of the RNA secondary structures have been widely studied. However, the tremendous demands on computer memory and computing time for complex secondary structures limit the capability of existing thermodynamically based algorithms for structure predictions to handling only short RNA sequences with a few hundred bases. One approach to overcome this limitation is by first cutting long RNA sequences into shorter, non-overlapping and manageable chunks whose secondary structures are predicted individually, and then assembling the resulting predictions for the chunks to give the structure of the original sequence.

The cutting process is a crucial component of this approach. Noting that all secondary structure elements, including stem-loops and pseudoknots, always contain an inversion, which is a stretch of nucleotides followed closely by its inverted complementary sequence, cutting methods based on inversion distributions have been proposed previously by our group. In this thesis, I compare three sequence cutting methods, called the centered, optimized, and regular methods, in terms of their capabilities to retain the prediction accuracy of the PKnotsRG algorithm after applying the cutting methods.

From the RFAM database, two sets of RNA sequences with known secondary structures have been selected as test data for the cutting methods. The first set contains 50 sequences without pseudoknots, and the second set contains 12 sequences with pseudoknots. The ratio between the prediction accuracy obtained with and without chunking is calculated over a range of inversion parameters, namely the minimum stem length l and the maximum gap size G. With l

ranging from 3 to 8, and G from 0 to 8, the maximum accuracy retention (MAR) percentage is obtained for each test sequence and each cutting method. We also experiment with varying the maximum chunk length c between 60 and 300 and observe its influence on the MAR.

To systematically analyze the impact of the various cutting methods, predictions algorithms, and inversion parameters, we have established a modularized parallel computing framework using Hadoop MapReduce that enables us to automatically and efficiently explore large parametric spaces of chunking-, prediction-, reconstruction-, and analysis methods. To study the framework performance, we use a dataset of longer sequences consisting of seven RNA genomes of the viruses from the family Nodaviridae with lengths around 1300 or 3200 bases. Their secondary structures are not known, and because of their lengths, the use of MapReduce is

vital for the exhaustive exploration of their possible secondary structures.

For the majority of test sequences, our results show that at least one of the cutting methods produces an MAR value greater than one, implying that the prediction accuracy of the PKnotsRG algorithm is actually improved by using the chunks instead of the whole sequence. Furthermore, the inversion based centered and optimized methods outperform the regular method that cuts the sequence naively in fixed length chunks. This suggests that our approach to

secondary structure prediction of long RNA sequences by cutting is viable but the cutting should be performed intelligently by considering sequence features such as inversions.

The MapReduce performance analyses have also demonstrated that our approach can be implemented to run efficiently in the Hadoop MapReduce framework. This opens up possibilities to continue my research on exploring better models for secondary structure elements, testing the cutting methods with other prediction algorithms, and finding optimal values for inversion and chunk length parameters for prediction of secondary structures in long RNA sequences.




Received from ProQuest

File Size

84 pages

File Format


Rights Holder

Daniel Tesfai Yehdego