Prediction and Classification of G Protein-coupled Receptors Using Statistical and Machine Learning Algorithms
G protein-coupled receptors (GPCRs) are transmembrane proteins with important functions in signal transduction and often serve as therapeutic drug targets. With increasing availability of protein sequence information, there is much interest in computationally predicting GPCRs and classifying them to indicate their possible biological roles. Such predictions are cost-efficient and can be valuable guides for designing wet lab experiments to help elucidate signaling pathways and expedite drug discovery. There are existing computational tools for GPCR prediction and classification that involve statistical and machine learning approaches such as principal component analysis, support vector machines, hidden Markov models, etc. These tools use protein sequence derived features including amino acid and dipeptide compositions and other autocorrelation descriptors of physicochemical properties. While prediction accuracies of over 90% were generally reported for their own test data, no direct performance comparison of the different tools has been conducted using a unified test dataset. Furthermore, their abilities in distinguishing GPCRs from transmembrane non-GPCRs have not been measured, and none of the existing tools has the capability of fully classifying a general GPCR down to the subtype level. In this dissertation, I proposed two new methods, the penalized multinomial logistic regression (Log-Reg) algorithm and the multi-layer perceptron neural network (MLP-NN) to address this multilayer problem of GPCR prediction and classification using 1360 sequence features. Training and testing were conducted uniformly with a test dataset containing 2016 confirmed GPCRs, and 3100 negative examples including transmembrane non-GPCRs. To assess our new methods, their performance were compared with two available tools, GPCRpred and PCA-GPCR. Both Log-Reg and MLP-NN substantially reduced the false positive rates in distinguishing GPCRs from transmembrane non-GPCRs. They also produced highly accurate GPCR classification results down to the subtype level with average accuracies in the 96-99% range. Furthermore, we applied feature reduction techniques to generate a non-redundant feature set to increase the computational efficiency for Log-Reg and MLP-NN with little impact on accuracy. These algorithms have been implemented as Python programs and are being incorporated into the web server gpcr.utep.edu, which can be accessed by GPCR researchers worldwide.
Ayivor, Fredrick, "Prediction and Classification of G Protein-coupled Receptors Using Statistical and Machine Learning Algorithms" (2020). ETD Collection for University of Texas, El Paso. AAI27999432.