Matroid - Based Variable Selection for Complex Data Structures
Abstract
This research project has the objective to extend use of the matroid algorithm using statistically based criteria, Joint/Multivariate Cumulants (Speed, 1983) and Effective Dependence (Pena & Rodriguez, 2003) to capture linear as well as non-linear higher order dependencies. We also improve variable selection for complex data structures using the proposed matroid algorithm. The limiting distribution of the joint cumulant was defined using U-statistics theory by Hoeffding (1948). U-statistics variance as theorized by Hoeffding provide a lower bound for the estimated variance, and our simulation results justify the use of Hoeffding U-statistic variance for determining a threshold for joint cumulants deviation from zero. We use the definition of dependent sets given in Greene (1990) to define the matroid. The algorithm finally identifies the maximal set of covariates that can be depicted by a j dimensional projection known as flats. We also utilize the effective dependence theory proposed by Pena and Rodriguez (2003) to compare groups with different numbers of variables. The effective rank of the flat provides an estimate for the number of variables that we need to choose from each flat. Simulation studies are carried out to assess variable selection using the matroid approach compare to the traditional variable selection methods under different parameter values and sample sizes. We present two illustrative examples using Fano and Induced Collinearity structures. Applications to real data and some concluding remarks are presented.
Subject Area
Statistics
Recommended Citation
Jayanetti, Wimarsha Thathsarani, "Matroid - Based Variable Selection for Complex Data Structures" (2018). ETD Collection for University of Texas, El Paso. AAI10841321.
https://scholarworks.utep.edu/dissertations/AAI10841321