Date of Award

2021-05-01

Degree Name

Master of Science

Department

Mathematical Sciences

Advisor(s)

Xiaogang X. Su

Abstract

The significant advances in technology have enabled easy collection and management of high-dimensional data in many fields, however, the process of modeling these data imposes a huge problem in the field of data science. Dealing with high-dimensional data is one of the significant challenges that degenerate the performance and precision of most classification and regression algorithms, e.g., random forests. Random Forest (RF) is among the few methods that can be extended to model high-dimensional data; nevertheless, its performance and precision, like others, are highly affected by high dimensions, especially when the dataset contains a huge number of noise or noninformative features. It is known in literature that data dominated with a high number of uninformative features have a small number of expected informative variables that could lead to the challenge of obtaining an accurate or robust random forest model.

In this study, we present a new algorithm that incorporates ridge regression as a variable screening tool to discern informative features in the setting of high dimensions and apply the classical random forest to a top portion of selected important features. Simulation studies on high dimensions are carried out to test how our proposed method addressesthe above problem and improves the performance of random forest models. To illustrate our method, we applied it to a real-life dataset (Communities and Crime Dataset), which was sourced from the UCI database. The results show how variable screening using ridge regression could be a very useful tool for building high-dimensional random forests.

Language

en

Provenance

Received from ProQuest

File Size

77 p.

File Format

application/pdf

Rights Holder

Roland Fiagbe

Share

COinS