Improving biterm topic model with word embeddings

PDF / 1,536,472 Bytes
26 Pages / 439.642 x 666.49 pts Page_size
15 Downloads / 244 Views

Improving biterm topic model with word embeddings Jiajia Huang1 · Min Peng2 · Pengwei Li1 · Zhiwei Hu3 · Chao Xu1 Received: 12 September 2019 / Revised: 29 April 2020 / Accepted: 4 May 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm’s semantically related words to the same topic during the sampling process via generalized P olya ´ Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility. Keywords Topic model · Word embeddings · Short texts · Noise biterm · BTM Pengwei Li

[email protected]

Extended author information available on the last page of the article.

World Wide Web

1 Introduction With the development of social media, short texts have become popular information carriers on the Internet. The texts include tweets, questions in Q&A community, labels of images or videos, news titles and comments and so on. Discovering knowledge hidden in large scale of short texts has become a challenge and promising research issue, which is embodied as various tasks, such as topic extraction [8, 37, 38], emerging event detection [12, 26], comments summarization [23, 34], conversation gen

Data Loading...

Improving biterm topic model with word embeddings

Recommend Documents

Improving POS Tagging Across Portuguese Variants with Word Embeddings

Studying Ideational Change in Russian Politics with Topic Models and Word Embeddings

Research on Hot Topic Discovery Technology of Micro-blog Based on Biterm Topic Model

Learning class-specific word embeddings

Semantic Composition of Word-Embeddings with Genetic Programming

Correction to: Learning class-specific word embeddings

Joint Multiclass Debiasing of Word Embeddings

Dual embeddings and metrics for word and relational similarity

Enhancing the Numeracy of Word Embeddings: A Linear Algebraic Perspective

Improvement of Short Text Clustering Based on Weighted Word Embeddings

Fast Pathfinding in Knowledge Graphs Using Word Embeddings

A Deep Learning Architecture with Word Embeddings to Classify Sentiment in Twitter