Date of Award
2025-12-01
Degree Name
Doctor of Philosophy
Department
Data Science
Advisor(s)
Son-Young Yi
Abstract
High-cardinality categorical variables remain difficult to model in tabular data, where classical encoders encounter sparsity, susceptibility to leakage, and the loss of meaningful relational structure. This dissertation develops a unified framework for learning, evaluating, and synthesizing representations of such variables using both traditional encoders and modern embedding methods, including Word2Vec, FastText, Node2Vec, TF–IDF/SVD, and supervised entity embeddings. The framework is applied across three benchmark datasets (Adult, PetFinder, Breast Cancer) and a hierarchical educational case study (IPEDS/CIP). Embedding quality is examined through both downstream predictive performance and structure-focused diagnostics that quantify neighborhood behavior and geometric coherence. To assess whether synthetic data can reproduce these learned relationships, multiple generative models–CTGAN, TVAE, and LLM-based tabular synthesis–are evaluated under Train-on-Synthetic/Test-on-Real (TSTR) and Train-on-Real/Test-on-Synthetic (TRTS) protocols. Across these studies, patterns emerge regarding when learned embeddings offer advantages over classical encodings, how different embedding families capture distinct forms of categorical structure, and the extent to which synthetic data methods recover these properties. The dissertation ultimately provides a systematic methodology for representing and synthesizing high-cardinality categorical variables, with implications for privacy-preserving analytics, large-scale educational research, and future work in structure-aware generative modeling.
Language
en
Provenance
Received from ProQuest
Copyright Date
2025-12
File Size
121 p.
File Format
application/pdf
Rights Holder
Cesar Iram Vazquez
Recommended Citation
Vazquez, Cesar Iram, "A Unified Framework For Embedding-Based Synthetic Data Generation With High Cardinality Categorical Features" (2025). Open Access Theses & Dissertations. 4602.
https://scholarworks.utep.edu/open_etd/4602