Date of Award

2025-12-01

Degree Name

Doctor of Philosophy

Department

Computer Science

Advisor(s)

Christopher Kiekintveld

Second Advisor

Aritran Piplai

Abstract

State-of-the-art machine and deep learning models generally perform well on previously seen data, albeit with wrong close world assumption that all real-world data are from previously seen train and validation samples, hence there poor performance when exposed to data which deviates from previously seen training and validation set. This is clearly evident in the domain of cybersecurity where the world continues to experience several high profile malware attacks despite advancement in state-of-the-art research. The reason being that the constant evolvement of innovation in the development of tools and method deployed to carry out various attacks had given hackers and other cybercriminals alike significant leverage which enables them to easily carry out more intelligent and robust attacks due to (i) the significant evolvement of such tools make it easier for new variants of malware to be created (ii) the rate at which new malware variants are developed significantly outpace state-of-the-art research as an average of over 1,500 brand new malware variants are created on daily bases according to SonicWall statistics (iii) the awareness of cybercriminals to vulnerabilities of current state-of-the-art machine and deep learning models to new malware variants.

To address the vulnerability of state-of-the-art machine and deep learning approaches to an out-of-distribution problem, several state-of-the-art approaches such as adversarial training, input transformation, self adaptive training, adversarial purification, zero-shot, one-shot, few-shot had been proposed and applied to an arrays of benchmark datasets in various research domain but none of those approaches had been applied to an actual out-of-distribution malware attack problem. During our initial investigative research, we implemented these approaches on four (4) benchmark malware datasets in an out-of-distribution settings which all gave a poor performance thereby leading to our assertion that the poor performance of current state-of-the-art approaches to an out of distribution malware attack classification is not unconnected to variations of each malware variants from the same malware family unlike other domain dataset. Considering that, current state-of-the-art out-of-distribution approaches does not address the inter-family variation in dynamic and static behavior among malware from the same family as evidence in the dismal performance of such models when exposed to an out-of-distribution malware.

We proposed a two-stage framework that addresses this limitation by incorporating Gaussian discriminant embeddings into deep neural networks to model spherical decision boundaries around malware families in the embedding space. The first stage employs unsupervised cluster analysis to determine whether a test sample is in-distribution or out-of-distribution, using z-score-based statistical analysis for reliable outlier detection. The second stage introduces a deep learning model trained on refined embeddings from the initial stage, using predictions from both the cluster analysis and a primary classifier to enhance final prediction accuracy. Evaluation on a dataset comprising 25 malware families and novel OOD samples demonstrates superior performance against softmax confidence and mahalahobis distance baseline, achieving an AUC of 0.911 for OOD detection. This approach significantly improves the distinguishability of OOD samples and offers a scalable and statistically grounded method for robust malware classification and anomaly detection in cybersecurity contexts. We address this problem of intra-family variation within same malware family by: 1) exploitation of the in-dimensional embedding space between variants from the same malware family to account for all variations 2) exploitation of the inter-dimensional space from different malware family 3) building a deep learning-based model with a shallow neural network containing maximum of two connected layers to overcome overfitting from the scratch 4) building a Bayesian inference based computation algorithm that intertwine with connected network and is able to create new and adjust existing data point in response to an exposure to an out-of-distribution variants of existing family of new malware family which determines and the extent at which weight should be adjustment thereby triggering the gradient. Finally, We will be evaluating our approach using various statistical measures and comparing it with various baselines.

Language

en

Provenance

Received from ProQuest

File Size

154 p.

File Format

application/pdf

Rights Holder

Tosin Olusola Ige

Share

COinS