Topic discovery by spectral decomposition and clustering with coordinated global and local contexts
- PDF / 1,805,709 Bytes
- 13 Pages / 595.276 x 790.866 pts Page_size
- 25 Downloads / 174 Views
ORIGINAL ARTICLE
Topic discovery by spectral decomposition and clustering with coordinated global and local contexts Jian Wang1 · Kejing He1 · Min Yang2 Received: 29 June 2019 / Accepted: 15 April 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Topic modeling is an active research field due to its broad applications such as information retrieval, opinion extraction and authorship identification. It aims to discover topic structures from a collection of documents. Significant progress have been made by the latent dirichlet allocation (LDA) and its variants. However, the “bag-of-words” assumption is usually made for the whole document by conventional methods, which ignores the semantics of local context that play crucial roles in topic modeling and document understanding. In this paper, we propose a novel coordinated embedding topic model (CETM), which incorporates spectral decomposition and clustering technique by leveraging both global and local context information to discover topics. In particular, CETM learns coordinated embeddings by using spectral decomposition, capturing the word semantic relations effectively. To infer the topic distribution, we employ a clustering algorithm to capture semantic centroids of coordinated embeddings and derive a fast algorithm to obtain the topic structures. We conduct extensive experiments on three real-world datasets to evaluate the effectiveness of CETM. Quantitatively, compared to state-of-the-art topic modeling approaches, CETM achieves significantly better performance in terms of topic coherence and text classification. Qualitatively, CETM is able to learn more coherent topics and more accurate word distributions for each topic. Keywords Topic modeling · Spectral decomposition · Clustering · Global context · Local context
1 Introduction With the growing of large collection of electronic texts, much attention has been given to topic modeling of textual corpora, which is designed to identify representations of data and learn thematic structure from large document collections without human supervision. Conventional topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [15] and Latent Dirichlet Allocation (LDA) [4], can be viewed as graphical models with latent variables. Some non-parametric extensions to LDA have been successfully * Kejing He [email protected] Jian Wang [email protected] Min Yang [email protected] 1
School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
2
applied to characterize the contents of documents [31, 33]. However, the inference of those non-parametric models are computationally hard, such that inaccurate or slow approximations are resorted to calculate the posterior distributions over the topics. New undirected graphical model approaches, including the replicated softmax model [14], are also successfully used to explore the topics of documents, and in p
Data Loading...