High-Dimensional Random Forests

Roland Fiagbe, University of Texas at El Paso

Abstract

The significant advances in technology have enabled easy collection and management of high-dimensional data in many fields, however, the process of modeling these data imposes a huge problem in the field of data science. Dealing with high-dimensional data is one of the significant challenges that degenerate the performance and precision of most classification and regression algorithms, e.g., random forests. Random Forest (RF) is among the few methods that can be extended to model high-dimensional data; nevertheless, its performance and precision, like others, are highly affected by high dimensions, especially when the dataset contains a huge number of noise or noninformative features. It is known in literature that data dominated with a high number of uninformative features have a small number of expected informative variables that could lead to the challenge of obtaining an accurate or robust random forest model.In this study, we present a new algorithm that incorporates ridge regression as a variable screening tool to discern informative features in the setting of high dimensions and apply the classical random forest to a top portion of selected important features. Simulation studies on high dimensions are carried out to test how our proposed method addresses the above problem and improves the performance of random forest models. To illustrate our method, we applied it to a real-life dataset (Communities and Crime Dataset), which was sourced from the UCI database. The results show how variable screening using ridge regression could be a very useful tool for building high-dimensional random forests.

Subject Area

Statistics

Recommended Citation

Fiagbe, Roland, "High-Dimensional Random Forests" (2021). ETD Collection for University of Texas, El Paso. AAI28499471.
https://scholarworks.utep.edu/dissertations/AAI28499471

Share

COinS