Date of Award

2025-05-01

Degree Name

Doctor of Philosophy

Department

Mathematical Sciences

Advisor(s)

Jonathon Mohl

Abstract

Multi-domain machine learning applications have revolutionized how we understand and predict complex biological phenomena. This dissertation presents novel computational methodologies addressing two critical problems: mallard classification using single-nucleotide polymorphisms (SNPs), and protein function prediction via interpretable topic-aware peptide embeddings. The research focuses on distinguishing mallard populations through SNP data, which are inherently characterized by ultra-high dimensionality. The research uses advanced feature-selection and dimensionality-reduction strategies alongside machine learning classification algorithms to identify minimal, yet highly predictive SNP sets crucial for accurate breed differentiation. This framework demonstrates robust performance with optimal computational efficiency, significantly aiding conservation and breed management efforts. Furthermore, the research project also leverages natural language processing techniques applied to biological sequences, specifically employing enzyme-based sequence fragmentation (e.g., trypsin digestion) followed by embedding with Word2Vec models. Topic modeling (BERTopic) of these peptide embeddings facilitates functional classification (Gene Ontology term prediction), achieving ROC-AUC scores comparable to full-sequence models (98.9% vs. 99%). Notably, topic-derived peptides frequently align with known functional motifs, including ligand-binding sites, underscoring their biological significance and interpretability. Collectively, these studies illustrate the power of machine learning for handling diverse biological datasets, providing accurate predictive models and interpretable insights critical for practical biological discovery and decision-making.

Language

en

Provenance

Received from ProQuest

File Size

87 p.

File Format

application/pdf

Rights Holder

Tolulope Samuel Adeyina

Share

COinS