Document Graph Representation Learning

Text corpora constitute an important class of real-world data, covering Web pages, news articles, academic papers, user profiles, etc. To better make sense of the meaning within these text documents, researchers develop neural topic models, where each document is represented by a low-dimensional interpretable topic distribution. In addition to the textual content within documents, we discover that documents are usually connected in a network structure. For example, Google Web pages contain hyperlinks to other related pages, academic papers cite other papers, Facebook user profiles are connected as a social network, news articles with similar tags are linked together, etc. We call such data document graph.

In this thesis, Ce Zhang focuses on developing neural deep learning models for two main objectives. First, to incorporate both textual content within documents and connectivity across documents, he aims to design graph representation learning models for document graphs to integrate both modalities into a unified topic distribution. Second, to offer semantic interpretability of learned topics, he researches neural topic modeling and brings it to document graphs to improve topic quality. Both objectives mutually benefit each other. Connected documents tend to share similar latent topics (e.g., cited paper discuss similar research), thus modeling graph connectivity could capture document similarities and discover more interpretable topics. By achieving both objectives, we can better fulfil real-world applications, such as Web page searching, news article classification, academic paper indexing, and friend recommendation based on user profiles, etc.

Click on the video below to view a presentation on the research project!