Date of Award
Doctor of Philosophy
M. Shahriar Hossain
A massive amount of unstructured data, in this information age, is composed of document collections. Examples include news articles, blog posts, scholarly publications, and reports generated by organizations as well as people. Many data mining and machine learning algorithms have been developed in the past decade to support text mining in many applications. Text mining applications cover a wide range of tasks spanning from personal information management --- to organizational decision making, to disease control through epidemic prediction, and to intelligence analysis for national security. One bottleneck has been dominating data mining and machine learning theories for all these text mining applications --- the quality of the outcomes of the algorithms depends on the quality of the representation of the documents.
My research exploits imagery and textual content of documents to create high quality representations for documents, document tokens (e.g., names of people), and image snippets (e.g., faces of people in news images). My argument is that the utilization of both images and textual content of documents is crucial in generating a document representation because images within a document are included by the author(s) of the document to complement the textual content. In addition, visual objects found in the images of a document sometimes provide contextual information that might be missing in the textual content of the document. As an example, consider a document published this year that briefly describes a documentary that celebrates the life of Princess Diana. The textual content of the document does not contain the name Prince Charles or Queen Elizabeth. However, there is an image in the article that contains all three faces --- Princess Diana, Prince Charles, and Queen Elizabeth --- in the same photo. Inclusion of the image in the document representation will provide better features in this case because the image provides additional contextual information to enrich the content. My research focuses on incorporating such contextual information in representations in absence of annotated faces.
I seek to answer three research questions relevant to document representations: 1) how to extract contextual information in absence of labeled data and use that context to produce better representations for text snippets, 2) how to construct contextual information for visual objects (e.g., faces) found in images of a document collection, and 3) how to combine the contextual information extracted from imagery and textual content for document representation. To address the first question, I designed an objective function that uses temporal, geographical, and topical information of documents to generate a multi-graph of relationships between text fragments. I leveraged a neural network based model that leverages the multi-graph to produce high quality representations for textual entities including names of people, location, and organization found in the documents. In response to the second question, I developed a probabilistic model that generates probability distributions over person names, locations, and countries for every human face detected in the images of a document collection. My Dissertation scopes down all analyses to news articles. The visual objects in my Dissertation are human faces because they are abundant in news articles and provide direct or contextual relation with the content. Finally, to answer the third question, I propose a neural language model that exploits contextual information generated for faces and textual content to represent documents in a compact continuous space.
I demonstrate the effectiveness of the methods through a set of rigorous experiments and case studies. My experiments depict that the document representations generated by my proposed method improve the performance of many machine learning algorithms.
Received from ProQuest
Md Abdul Kader
Kader, Md Abdul, "Contextual Representation Of Documents, Entities, And Faces Of People Using A News Corpus" (2017). Open Access Theses & Dissertations. 471.