Date of Award


Degree Name

Master of Science


Computational Science


Ming-Ying Leung


Rapid advances in next generation sequencing (NGS) technologies provide many oppor- tunities to identify associations between genetic sequence variants (GSV) and diseases, which may lead to better clinical diagnosis and treatments. OncoMiner is a bioinformatics pipeline developed at UTEP ( for mining NGS data. It can identify exonic sequence variants, link them with associated literatures, visualize genomic locations and compare their occurrence frequencies among dierent groups. However, the current version of OncoMiner is limited to accepting only a specic input le format provided by the Otogenetics NGS Lab Services. The main objectives of my current work are (1) to de- velop a Python script for preprocessing the more widely used variant calling format (VCF) NGS les and convert them to the OncoMiner input (OMI) format, and (2) to evaluate the performance of the script.

Most of the required data elds in the OMI le can be extracted directly from the VCF le. The genomic region type, however, needs to be determined by comparison with a reference genome. Since I will be working on human cancer data, the reference genome used for this work is the human genome assembly hg38 obtained from UCSC Genome Browser. To improve eciency, the script splits the VCF le and reference genome by chromosomes into smaller les for parallel processing. Our script has been tested on 148 VCF les, containing data from prostate cancer patients, downloaded from The Cancer Genome Atlas (TCGA). Parallelization of the script obtained average speedups of 1.50, 2.28, 3.14, 3.84, 4.00 using 2, 4, 8, 16, 24 cores respectively. To test the programs capability of handling big datasets, 35 larger les with sizes ranging from 193.8 MB to 3.7 GB are used. These les contain data from leukemia patients, cell lines, and normal individuals collected at local hospitals and UTEP. Both the number of variants and the number of samples in the VCF le were found signicantly correlated with runtime. A multiple linear regression indicated that 83% variation in the runtime can be explained by its relationship with the numbers of variants and samples.

We plan to incorporate this preprocessing script into OncoMiner pipeline and use it for downstream analyses of a collection of 500 prostate cancer VCF les from TCGA, and the local leukemia dataset to identify GSVs associated with the diseases and prioritize risky variants based on their predicted functional effects.




Received from ProQuest

File Size

88 pages

File Format


Rights Holder

Bofei None Wang