Date of Award
Master of Science
Maria C. Mariani
This work investigates the predictive performance of 10 Machine learning models on three medical data including Breast cancer, Heart disease and Prostate cancer. Furthermore, we use the models to identify risk factors that contribute significantly to these diseases.
The models considered include; Logistic regression with L1 and L_2 penalties, Principal component logistic regression(PCR-LR), Partial least squares logistic regression(PLS-LR), Multivariate adaptive regression splines(MARS), Support vector machine with Radial Basis Kernel (SVM-RBK), Random Forest(RF), Gradient Boosting Machines(GBM), Elastic Net (Enet) and Feedforward Neural Network(FFNN). The models were grouped according to their similarities and learning style; i) Linear regularized models: LR-Lasso, LR-Ridge and LR-Enet. ii) Linear dimension reduction models: PCR and PLSR. iii) Non-Linear ensemble models : Random forest and Gradient Boosting. iv) Other Non-Linear models: FFNN, SVM and MARS. In all the applications the methodology provides insight into predictive performance of these model and the risk factors of these diseases.
The model selection and hyperparameter tuning were done using bias-variance tradeoff and cross-validation. The model's performance and generalization were improved for each method, by applying early stopping, dropout and removed non-significant variables to avoid overfitting. Different predictive performance measures were used including prediction accuracy, sensitivity and specificity depending of the nature of the response distribution whether balanced or imbalanced to compared the models.
The result show that the non-linear models; SVM-RBK, RF and FFNN gave the best predictive performance for the breast cancer data. The linear models; LR-PLS, LR-PC ,LR-Ridge and LR-Enet were preferred for heart disease and mixture of linear and non-linear models; LR-Lasso, LR-Enet, RF and GBM best describes the prostate cancer predictions.
Received from ProQuest
Biney, Francis, "Comparing predictive performance of statistical learning models on Medical data" (2020). Open Access Theses & Dissertations. 3083.