Time-Reflective Text Representations for Semantic Evolution Tracking and Trend Analytics
Abstract
The extraction of significant, relevant, and useful trends from massive document collections, such as a streaming newswire or scientific publications, is a challenging and significant problem in many different fields, including intelligence analysis, recommendation systems, and scientific research. However, techniques that tackle trend analytics of such large text corpora are limited because research that addresses the temporal nature of these publications is still in its early stages. In this work, we first show that it is possible to capture the evolution of a story (or trend) by connecting the dots between different documents in a text corpus. The observed results indicate that it is possible to transfer the idea of capturing evolution from a story level to a more general language-model level. Thus, we introduce a preliminary time-reflective frequency-based representation, which can capture the semantic evolution of a language model over time while being robust against the uncertainty and noise present in the real-world data. This preliminary representation has some shortcomings, including high dimensionality and lack of extensibility, which limit its potential use with trend-analytics techniques. We solve the shortcomings of the frequency-based representation by proposing a diffusion-based temporal word embedding model. The proposed technique generates low-dimensional word embeddings that emulate the temporal semantic changes observed in the frequency-based time-reflective representation. The proposed low-dimensional representation is suitable for trend-analytics algorithms. We apply several sequential modeling techniques on the temporal word embeddings to predict how the embedding space will look like in the future based on the current trends. We exploit the generated trend models for automatic hypothesis generation by finding potential future relationships between terms. The applications explored in this work include high impact areas such as intelligence analysis and cybersecurity, with analyses that study possible associations between malicious entities, as well as medical sciences, potentially discovering unexplored relations between substances and diseases.
Subject Area
Computer science|Statistics
Recommended Citation
Camacho Barranco, Roberto, "Time-Reflective Text Representations for Semantic Evolution Tracking and Trend Analytics" (2019). ETD Collection for University of Texas, El Paso. AAI27666295.
https://scholarworks.utep.edu/dissertations/AAI27666295