Date of Award
2025-05-01
Degree Name
Master of Science
Department
Computer Science
Advisor(s)
Aritran Piplai
Abstract
The rapid rise of sophisticated malware variants poses significant challenges for cybersecurity analysts, particularly due to the scarcity of data on newly emerging threats. Due to privacy, legal, and operational constraints, malware samples are often not shareable; instead, organizations publish cyber threat intelligence (CTI) in natural language. However, these reports are typically unstructured and inconsistent, limiting their utility in machine learning (ML) models. This thesis explores whether high-fidelity, shareable threat intelligence can be automatically generated from structured malware behaviors to supplement ML models when direct access to malware samples is limited. Two central questions are addressed:(i) How can descriptions be made both representative and shareable, avoiding personally identifiable information (PII) or sensitive traits? (ii) Can these descriptions support downstream tasks such as few-shot malware classification in low-data conditions? To address these, a distance-aware contrastive loss improves alignment between behavioral data and text, while a privacy-aware penalty reduces sensitive content. The generated descriptions are used in a Model-Agnostic Meta-Learning (MAML) pipeline, with distilled knowledge improving downstream performance. Evaluations on CIC-AndMal-2020 and BODMAS show up to 42% improvement over pre-trained LLMs in few-shot classification, and a 10-20% gain through multimodal fusion. Gains are also reflected in semantic metrics such as RAGAS Answer Correctness and Similarity. By enabling the automated generation of task-relevant CTI, this work facilitates secure sharing of anonymized behavioral profiles, thereby advancing collaborative threat detection, improving integration into real-world security systems, and empowering organizations to crowdsource effective defense strategies against emerging threats.
Language
en
Provenance
Received from ProQuest
Copyright Date
2025-05
File Size
63 p.
File Format
application/pdf
Rights Holder
Ivan Alejandro Montoya Sanchez
Recommended Citation
Montoya Sanchez, Ivan Alejandro, "From Text to Utility: Distance-Aware Contrastive Learning for Detection-Ready and Shareable Malware Descriptions" (2025). Open Access Theses & Dissertations. 4418.
https://scholarworks.utep.edu/open_etd/4418