Improving biterm topic model with word embeddings

  • PDF / 1,536,472 Bytes
  • 26 Pages / 439.642 x 666.49 pts Page_size
  • 15 Downloads / 230 Views

DOWNLOAD

REPORT


Improving biterm topic model with word embeddings Jiajia Huang1 · Min Peng2 · Pengwei Li1 · Zhiwei Hu3 · Chao Xu1 Received: 12 September 2019 / Revised: 29 April 2020 / Accepted: 4 May 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm’s semantically related words to the same topic during the sampling process via generalized P olya ´ Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility. Keywords Topic model · Word embeddings · Short texts · Noise biterm · BTM  Pengwei Li

[email protected]

Extended author information available on the last page of the article.

World Wide Web

1 Introduction With the development of social media, short texts have become popular information carriers on the Internet. The texts include tweets, questions in Q&A community, labels of images or videos, news titles and comments and so on. Discovering knowledge hidden in large scale of short texts has become a challenge and promising research issue, which is embodied as various tasks, such as topic extraction [8, 37, 38], emerging event detection [12, 26], comments summarization [23, 34], conversation gen