A Pseudo-document-based Topical N-grams model for short texts

PDF / 1,039,130 Bytes
23 Pages / 439.642 x 666.49 pts Page_size
47 Downloads / 239 Views

A Pseudo-document-based Topical N-grams model for short texts Hao Lin1 · Yuan Zuo1 · Guannan Liu1 · Hong Li1 · Junjie Wu1,2,3 · Zhiang Wu4 Received: 6 July 2019 / Revised: 19 January 2020 / Accepted: 30 March 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In recent years, short text topic modeling has drawn considerable attentions from interdisciplinary researchers. Various customized topic models have been proposed to tackle the semantic sparseness nature of short texts. Most (if not all) of them follow the bag-of-words assumption, which, however, is not adequate since word order and phrases are often critical to capturing the meaning of texts. On the other hand, while some existing topic models are sensitive to word order, they do not perform well on short texts due to the severe data sparseness. To address these issues, we propose the Pseudo-document-based Topical NGrams model (PTNG), which alleviates the data sparsity problem of short texts while is sensitive to word order. Extensive experiments on three real-world data sets with state-ofthe-art baselines demonstrate the high quality of topics learned by PTNG according to UCI coherence scores and more discriminative semantic representation of short texts according to classification results. Keywords Short text · Topic model · Word order · Topical N-Grams

1 Introduction Short text is being the prevalent format of information on the Internet, due to the explosive growth of online social media like Twitter and Facebook. Almost 500 million tweets daily on Twitter, for example, can be produced by around 250 million active users. This massive short texts carry sophisticated information which can hardly be found in conventional sources of information [34]. The accurate knowledge discovery of short texts has therefore been recognized as a challenging yet promising research problem. The archetypal topic model, i.e., Latent Dirichlet Allocation (LDA) [1], performs relatively poor when directly applied to short texts for the lack of word co-occurrence information [24] compared to normal size documents. Therefore, many research efforts have Yuan Zuo

[email protected]

Extended author information available on the last page of the article.

World Wide Web

been devoted to tackle incompetence of LDA in modeling short texts. Several customized topic models [5, 10, 21, 31, 32, 35, 36] have been proposed to alleviate the data sparsity issue of short texts. One potential limitation of the above models is that they all follow the bag-of-words assumption, which brings in computational efficiency but might severely hurt the accuracy of topic modeling for the ignorance of word order. We list two detailed reasons as follows: –

–

Sentences have the same bag-of-words representation could have quite different meanings. For instance, “the department chair couches offer” and “the chair department offers couche” are about quite different topics while have the same unigram statistics [25]. Different from normal size documents, many short texts

Data Loading...

A Pseudo-document-based Topical N-grams model for short texts

Recommend Documents

Bag of biterms modeling for short texts

Online Topic Modeling for Short Texts

An Evidential Reasoning Framework for User Profiling Using Short Texts

DeepStyle: User Style Embedding for Authorship Attribution of Short Texts

Topics extraction in incremental short texts based on LSTM

Topical Therapies for Acne

A Quasi-2D Model for Reverse Short Channel Effect

Contextual Predictability of Texts for Texts Processing and Understanding

Hashtag recommendation for short social media texts using word-embeddings and external knowledge

A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts

Topical Side Effects of Topical Corticosteroids

Orthographic features for emotion classification in Chinese in informal short texts