Date of Award

2018-01-01

Degree Name

Master of Science

Department

Mathematical Sciences

Advisor(s)

Xiaogang Su

Abstract

Binary classification is one of the main themes of supervised learning. This research is concerned about determining the optimal cutoff point for the continuous-scaled outcomes (e.g., predicted probabilities) resulting from a classifier such as logistic regression. We make note of the fact that the cutoff point obtained from various methods is a statistic, which can be unstable with substantial variation. Nevertheless, due partly to complexity involved in estimating the cutpoint, there has been no formal study on the variance or standard error of the estimated cutoff point.

In this Thesis, a bootstrap aggregation method is put forward to estimate the optimal cutoff point c. In our approach, the out-of-bag samples facilitate a natural way of performing cross validation when computing the predicted probabilities. The ensemble learning helps reduce the variation in the estimated cutoff point. Furthermore, we are able to compute the standard error of the estimated cutoff point conveniently via the infinitesimal jackknife method, a by-product of the bootstrap aggregation method without adding much to the computational cost. Accordingly, valid confidence intervals for c can be constructed.

Extensive simulation studies are conducted to investigate the performance of the proposed method and gain experience with its usage. Throughout the research, our focus is restricted to logistic regression. While the bootstrap aggregation method yields valid and promising results in general, a number of interesting observations and useful conclusions are drawn from the simulated experiments. Through simulation studies, we conclude that bias correction is not optional but necessary. The number of bootstrap samples, B, needs to be large (about 2,000) to have valid results. On the other hand, out-of-bag samples does not work for the SE computation.

For an empirical illustration, we applied our proposed method to a viral news detection study. Logistic regression analysis results in meaningful findings and accurate predictions with an appropriate choice of the cutoff point. Fancy topics, subjectivity, and negative words can gain more clicks and shares. We also show that the proposed bootstrap aggregation method is quite reliable and effective.

Language

en

Provenance

Received from ProQuest

File Size

44 pages

File Format

application/pdf

Rights Holder

Zheng Zhang

Share

COinS