Topic discovery by spectral decomposition and clustering with coordinated global and local contexts

PDF / 1,805,709 Bytes
13 Pages / 595.276 x 790.866 pts Page_size
25 Downloads / 188 Views

ORIGINAL ARTICLE

Topic discovery by spectral decomposition and clustering with coordinated global and local contexts Jian Wang1 · Kejing He1 · Min Yang2 Received: 29 June 2019 / Accepted: 15 April 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Topic modeling is an active research field due to its broad applications such as information retrieval, opinion extraction and authorship identification. It aims to discover topic structures from a collection of documents. Significant progress have been made by the latent dirichlet allocation (LDA) and its variants. However, the “bag-of-words” assumption is usually made for the whole document by conventional methods, which ignores the semantics of local context that play crucial roles in topic modeling and document understanding. In this paper, we propose a novel coordinated embedding topic model (CETM), which incorporates spectral decomposition and clustering technique by leveraging both global and local context information to discover topics. In particular, CETM learns coordinated embeddings by using spectral decomposition, capturing the word semantic relations effectively. To infer the topic distribution, we employ a clustering algorithm to capture semantic centroids of coordinated embeddings and derive a fast algorithm to obtain the topic structures. We conduct extensive experiments on three real-world datasets to evaluate the effectiveness of CETM. Quantitatively, compared to state-of-the-art topic modeling approaches, CETM achieves significantly better performance in terms of topic coherence and text classification. Qualitatively, CETM is able to learn more coherent topics and more accurate word distributions for each topic. Keywords Topic modeling · Spectral decomposition · Clustering · Global context · Local context

1 Introduction With the growing of large collection of electronic texts, much attention has been given to topic modeling of textual corpora, which is designed to identify representations of data and learn thematic structure from large document collections without human supervision. Conventional topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [15] and Latent Dirichlet Allocation (LDA) [4], can be viewed as graphical models with latent variables. Some non-parametric extensions to LDA have been successfully * Kejing He [email protected] Jian Wang [email protected] Min Yang [email protected] 1

School of Computer Science and Engineering, South China University of Technology, Guangzhou, China

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

2

applied to characterize the contents of documents [31, 33]. However, the inference of those non-parametric models are computationally hard, such that inaccurate or slow approximations are resorted to calculate the posterior distributions over the topics. New undirected graphical model approaches, including the replicated softmax model [14], are also successfully used to explore the topics of documents, and in p

Data Loading...

Topic discovery by spectral decomposition and clustering with coordinated global and local contexts

Recommend Documents

Equestrian Cultures in Global and Local Contexts

Local Contexts, Local Theory

Spectral Clustering

Spectral Super Resolution with DCT Decomposition and Deep Residual Learning

Pattern Discovery in Triadic Contexts

SB-Spectral Clustering Algorithm for Clustering and Validating Sensor

Global and Local Televangelism

Power Spectral Clustering

Global and Local Internationalization

Economics: Niche Markets and Global Contexts

Discovery of topic flows of authors

Spectral Decomposition in Hilbert Spaces