Date of Award

2025-12-01

Degree Name

Doctor of Philosophy

Department

Data Science

Advisor(s)

Son-Young Yi

Abstract

High-cardinality categorical variables remain difficult to model in tabular data, where classical encoders encounter sparsity, susceptibility to leakage, and the loss of meaningful relational structure. This dissertation develops a unified framework for learning, evaluating, and synthesizing representations of such variables using both traditional encoders and modern embedding methods, including Word2Vec, FastText, Node2Vec, TF–IDF/SVD, and supervised entity embeddings. The framework is applied across three benchmark datasets (Adult, PetFinder, Breast Cancer) and a hierarchical educational case study (IPEDS/CIP). Embedding quality is examined through both downstream predictive performance and structure-focused diagnostics that quantify neighborhood behavior and geometric coherence. To assess whether synthetic data can reproduce these learned relationships, multiple generative models–CTGAN, TVAE, and LLM-based tabular synthesis–are evaluated under Train-on-Synthetic/Test-on-Real (TSTR) and Train-on-Real/Test-on-Synthetic (TRTS) protocols. Across these studies, patterns emerge regarding when learned embeddings offer advantages over classical encodings, how different embedding families capture distinct forms of categorical structure, and the extent to which synthetic data methods recover these properties. The dissertation ultimately provides a systematic methodology for representing and synthesizing high-cardinality categorical variables, with implications for privacy-preserving analytics, large-scale educational research, and future work in structure-aware generative modeling.

Language

en

Provenance

Received from ProQuest

File Size

121 p.

File Format

application/pdf

Rights Holder

Cesar Iram Vazquez

Share

COinS