A Pseudo-document-based Topical N-grams model for short texts
- PDF / 1,039,130 Bytes
- 23 Pages / 439.642 x 666.49 pts Page_size
- 47 Downloads / 187 Views
A Pseudo-document-based Topical N-grams model for short texts Hao Lin1 · Yuan Zuo1 · Guannan Liu1 · Hong Li1 · Junjie Wu1,2,3 · Zhiang Wu4 Received: 6 July 2019 / Revised: 19 January 2020 / Accepted: 30 March 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract In recent years, short text topic modeling has drawn considerable attentions from interdisciplinary researchers. Various customized topic models have been proposed to tackle the semantic sparseness nature of short texts. Most (if not all) of them follow the bag-of-words assumption, which, however, is not adequate since word order and phrases are often critical to capturing the meaning of texts. On the other hand, while some existing topic models are sensitive to word order, they do not perform well on short texts due to the severe data sparseness. To address these issues, we propose the Pseudo-document-based Topical NGrams model (PTNG), which alleviates the data sparsity problem of short texts while is sensitive to word order. Extensive experiments on three real-world data sets with state-ofthe-art baselines demonstrate the high quality of topics learned by PTNG according to UCI coherence scores and more discriminative semantic representation of short texts according to classification results. Keywords Short text · Topic model · Word order · Topical N-Grams
1 Introduction Short text is being the prevalent format of information on the Internet, due to the explosive growth of online social media like Twitter and Facebook. Almost 500 million tweets daily on Twitter, for example, can be produced by around 250 million active users. This massive short texts carry sophisticated information which can hardly be found in conventional sources of information [34]. The accurate knowledge discovery of short texts has therefore been recognized as a challenging yet promising research problem. The archetypal topic model, i.e., Latent Dirichlet Allocation (LDA) [1], performs relatively poor when directly applied to short texts for the lack of word co-occurrence information [24] compared to normal size documents. Therefore, many research efforts have Yuan Zuo
[email protected]
Extended author information available on the last page of the article.
World Wide Web
been devoted to tackle incompetence of LDA in modeling short texts. Several customized topic models [5, 10, 21, 31, 32, 35, 36] have been proposed to alleviate the data sparsity issue of short texts. One potential limitation of the above models is that they all follow the bag-of-words assumption, which brings in computational efficiency but might severely hurt the accuracy of topic modeling for the ignorance of word order. We list two detailed reasons as follows: –
–
Sentences have the same bag-of-words representation could have quite different meanings. For instance, “the department chair couches offer” and “the chair department offers couche” are about quite different topics while have the same unigram statistics [25]. Different from normal size documents, many short texts
Data Loading...