Open Access Theses & Dissertations

A Unified Framework For Embedding-Based Synthetic Data Generation With High Cardinality Categorical Features

Cesar Iram Vazquez, University of Texas at El Paso

Date of Award

2025-12-01

Degree Name

Doctor of Philosophy

Department

Data Science

Advisor(s)

Son-Young Yi

Abstract

High-cardinality categorical variables remain difficult to model in tabular data, where classical encoders encounter sparsity, susceptibility to leakage, and the loss of meaningful relational structure. This dissertation develops a unified framework for learning, evaluating, and synthesizing representations of such variables using both traditional encoders and modern embedding methods, including Word2Vec, FastText, Node2Vec, TF–IDF/SVD, and supervised entity embeddings. The framework is applied across three benchmark datasets (Adult, PetFinder, Breast Cancer) and a hierarchical educational case study (IPEDS/CIP). Embedding quality is examined through both downstream predictive performance and structure-focused diagnostics that quantify neighborhood behavior and geometric coherence. To assess whether synthetic data can reproduce these learned relationships, multiple generative models–CTGAN, TVAE, and LLM-based tabular synthesis–are evaluated under Train-on-Synthetic/Test-on-Real (TSTR) and Train-on-Real/Test-on-Synthetic (TRTS) protocols. Across these studies, patterns emerge regarding when learned embeddings offer advantages over classical encodings, how different embedding families capture distinct forms of categorical structure, and the extent to which synthetic data methods recover these properties. The dissertation ultimately provides a systematic methodology for representing and synthesizing high-cardinality categorical variables, with implications for privacy-preserving analytics, large-scale educational research, and future work in structure-aware generative modeling.

Language

Provenance

Received from ProQuest

Copyright Date

2025-12

File Size

121 p.

File Format

application/pdf

Rights Holder

Cesar Iram Vazquez

Recommended Citation

Vazquez, Cesar Iram, "A Unified Framework For Embedding-Based Synthetic Data Generation With High Cardinality Categorical Features" (2025). Open Access Theses & Dissertations. 4602.
https://scholarworks.utep.edu/open_etd/4602

Download

Included in

Computer Sciences Commons, Statistics and Probability Commons

COinS

Open Access Theses & Dissertations

A Unified Framework For Embedding-Based Synthetic Data Generation With High Cardinality Categorical Features

Date of Award

Degree Name

Department

Advisor(s)

Abstract

Language

Provenance

Copyright Date

File Size

File Format

Rights Holder

Recommended Citation

Included in

Search

Links

Browse

Author Corner

Open Access Theses & Dissertations

A Unified Framework For Embedding-Based Synthetic Data Generation With High Cardinality Categorical Features

Author

Date of Award

Degree Name

Department

Advisor(s)

Abstract

Language

Provenance

Copyright Date

File Size

File Format

Rights Holder

Recommended Citation

Included in

Share

Search

Links

Browse

Author Corner