Using word embeddings for text classification in positive and unlabeled learning
Machine Learning is a sub-field of Artificial intelligence that aims to automatically improve algorithms by experience. It has been used successfully to solve various problems, such as playing checkers, or even as simple as word prediction when typing a sentence. These algorithms perform best with large amounts of training data. The more labeled data, the better a machine learning algorithm will be able to recognize patterns. However, the ideal scenario, where there is a large amount of labeled data available to train the algorithm, does not occur all the time. There are cases where labeling data is both time-consuming and expensive. The problem of lacking training data created an interest in such cases, where an algorithm could only work with a small labeled training set. One of the practices is semi-supervised learning, which uses a large set of unlabeled examples to supplement the small labeled training set. A classical example of a semi-supervised algorithm is the Co-Training algorithm. It uses a set of positive and negative labeled examples to label the unlabeled set through machine learning, rather than doing so manually. Co-Training is used in various problems that use a small amount of labeled trained data, such as Web-Page Classification and Image-Detection. While semi-supervised training is not as accurate as supervised training, it creates a good solution to problems where there is not enough labeled training data. This field has made advances in many problems, such as gene disease identification in the field of bio-bioinformatics. In this thesis, I will focus on one of the fields that has shown promise with this type of learning: using Positive and Unlabeled Learning for Natural Language Processing. I propose a change in a Positive and Unlabeled learning algorithm, Multi-Level Example Learning, that uses word embeddings to improve the results of the original algorithm for text classification.
Tafoya, Emmanuel Carlo, "Using word embeddings for text classification in positive and unlabeled learning" (2016). ETD Collection for University of Texas, El Paso. AAI10250967.