Gene Selection and Classification in High-Throughput Biological Data With Integrated Machine Learning Algorithms and Bioinformatics Approaches

Abhijeet R Patil, University of Texas at El Paso


With the rise of high throughput technologies in biomedical research, large volumes of expression profiling, methylation profiling, and RNA-sequencing data are being generated. These high-dimensional data have large number of features with small number of samples, a characteristic called the “curse of dimensionality.” The selection of optimal features, which largely affects the performance of classification algorithms in machine learning models, has led to challenging problems in bioinformatics analyses of such high-dimensional datasets. In this work, I focus on the design of two-stage frameworks of feature selection and classification and their applications in multiple sets of colorectal cancer data. The first algorithm developed was a combination of resampling based least absolute shrinkage and selection operator (lasso) feature selection (RLFS) and ensembles of regularized regression models (ERRM) capable of handling data with high correlation structures. The ERRM boosted the prediction accuracy with the top-ranked features obtained from RLFS. The second algorithm was a modified adaptive lasso method with normalized weights from various feature selection methods. Here, the genes were ranked based on their levels of statistical significance. The scores of the ranked genes were normalized and assigned as proposed weights to the adaptive lasso method to obtain the most significant genes known to be biologically related to the cancer type and helped attain higher classification performance. Lastly, we introduced a resampling of group lasso (glasso) feature selection method capable of ignoring the features unrelated to the response variable considering the group correlation among the features. These features, when applied on various classifiers, showed an increase in the classification accuracy. We applied the above algorithms on both simulated and real data to show that our methods have better performance compared to existing ones. In the real data application, we combined machine learning with various bioinformatics tools, such as STRINGdb and Cytoscape, to explore 13 sets of microarray and RNA-seq data to identify hub genes in colorectal cancer. The results could be useful for suggesting further studies to reveal potential biomarkers that might lead to better cancer diagnoses and treatments.

Subject Area

Bioinformatics|Applied Mathematics|Biostatistics|Computer science

Recommended Citation

Patil, Abhijeet R, "Gene Selection and Classification in High-Throughput Biological Data With Integrated Machine Learning Algorithms and Bioinformatics Approaches" (2021). ETD Collection for University of Texas, El Paso. AAI28496302.