Bag of biterms modeling for short texts

  • PDF / 2,045,775 Bytes
  • 36 Pages / 439.37 x 666.142 pts Page_size
  • 63 Downloads / 215 Views

DOWNLOAD

REPORT


Bag of biterms modeling for short texts Anh Phan Tuan1 · Bach Tran1 · Thien Huu Nguyen2 · Linh Ngo Van1 Khoat Than1

·

Received: 20 September 2018 / Revised: 8 June 2020 / Accepted: 13 June 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via bag of biterms, and (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g., bag of words, tf-idf) even for normal texts. Keywords Short texts · Document representation · Topic modeling · Short text classification

1 Introduction In recent years, short texts have emerged as a dominant source of text data, being used in the major activities on the web such as search queries, tweets, tags, messages and social network posts. It is therefore crucial for us to be able to automatically analyze such large

Anh Phan Tuan and Bach Tran have contributed equally to this work as first authors. This paper is an extended version of our PAKDD2016 paper [33].

B

Linh Ngo Van [email protected]

Extended author information available on the last page of the article

123

A. P. Tuan et al.

amount of short texts and gain valuable knowledge from it. Conventional topic modeling techniques such as pLSA [1], LDA [2] and HDP [3] are the natural considerations to perform such analysis as they have been demonstrated as the successful techniques for text analysis with the usual long documents. Unfortunately, the direct application of those topic modeling techniques causes various issues for short texts due to their unique characteristics of being short, informal, massive and dynamic. A typical issue concerns the shortness of the texts. In p