Tracking Topical Evolution in Large Document Collections
Abstract
A large document collection that builds up over time usually contains a number of different themes. All of these themes or topics are not equally important at the same time. One topic might have high probabilities in some years due to some relevant events, and low probabilities in other years. Analyzing the evolution of such topics has useful applications in a variety of domains, for example, helping researchers to quickly see the changes of research topics in an area, assisting intelligence agents in tracking the activities of a terrorist group, or monitoring damages caused by a natural disaster. In this dissertation, I present three different models that I developed to capture the evolution of topics in dynamic corpora in different domains. First, I present a novel algorithm for finding the lineage of a scientific article. The algorithm provides a unique way of encoding temporal information in a document that helps discovering more interesting lineage compared to the other state-of-the-art models. Then, I propose a topic model called STEM that accurately extracts high-level themes from a corpus, and also simultaneously captures the evolutionary patterns of those themes. Topic models have been used for summarizing text corpora for a long time, but STEM is the first model that combines the ideas of supervised inference and topical evolution. In many contexts – political conflicts, for instance – topics don’t evolve only over time, they have different degrees of impact in different geolocations as well. Therefore, I finally developed a new spatiotemporal topic model that can track geopolitical conflicts over the temporal and geographical dimensions. For each of these models, I present results of qualitative and quantitative analysis on multiple real-world datasets demonstrating the effectiveness of the model.
Subject Area
Computer science
Recommended Citation
Naim, Sheikh Motahar, "Tracking Topical Evolution in Large Document Collections" (2018). ETD Collection for University of Texas, El Paso. AAI13426753.
https://scholarworks.utep.edu/dissertations/AAI13426753