Research on Hot Topic Discovery Technology of Micro-blog Based on Biterm Topic Model
In order to overcome data sparsity and expression diversity problems of short text and to improve the quality of clustering, this paper proposes a text feature enhancement method based on biterm topic model (BTM). First, we obtain the high frequency word
- PDF / 258,647 Bytes
- 11 Pages / 439.37 x 666.142 pts Page_size
- 10 Downloads / 231 Views
Abstract. In order to overcome data sparsity and expression diversity problems of short text and to improve the quality of clustering, this paper proposes a text feature enhancement method based on biterm topic model (BTM). First, we obtain the high frequency word matrix of underlying topic based on the extraction on the corpus using BTM and then strengthen the traditional vector space model (VSM) selectively with this matrix to reduce vector dimension and highlight the main features. Also, we propose a heat calculation equation combining with propagation characteristic and time effect of micro-blogs so that we can better demonstrate the evolution of a topic and analyze it. Experiments show that our method has achieved good results in improving the clustering quality and the heat calculation equation is also beneficial to the discovery and evolution of hot topics. Keywords: Biterm topic model Feature enhancement Topic discovery Hot topic evolution
1 Introduction The hot topic discovery technology using clustering analysis or topic extraction is to dig out meaningful content to which users pay their attentions from a large amount of information. It belongs to Topic Detection and Tracking (TDT) [1] and can be used in the entire area or in a specific domain. For some hot topics, it can completely find the attitudes of people and the subsequent of popular events. More important is that hot topic discovery can find some emerging hot topics without a lot of reports. As one of the most important micro media forms, micro-blog has lots of features such as wide information coverage, real-time, highly interactive and simple metadata. However, this short text will suffer from severe data sparsity problem and its oral and diverse expression is not conducive to the selection of characteristics too. To solve the problems above, this paper focuses on the research on data sparsity and expression diversity by applying biterm topic model (BTM) [2] in topic extraction of micro-blogs and strengthening the VSM by topic-word matrix. This can reduce vector dimension and preserve more original information at the same time. Also, we merge words that potentially express the same topic to solve the diversity problem. What’s more, we improve the K-means algorithm with propagation characteristic and time © Springer Nature Singapore Pte Ltd. 2017 H. Yuan et al. (Eds.): GRMSE 2016, Part II, CCIS 699, pp. 234–244, 2017. DOI: 10.1007/978-981-10-3969-0_27
Research on Hot Topic Discovery Technology
235
effect. The K value is adaptive and the clustering is incremental and a heat calculation equation is proposed to describe the degree of hot events and their evolution process.
2 Related Works 2.1
Research on Short Text Clustering
As a typical short text, micro-blog will suffer from severe data sparsity problem. At present, the improvement of short text clustering are mainly based on feature selection. In 2002, FTC algorithm [3] proposed by Beil et al. holds that some specific words will show in documents sharing the same category. This means t
Data Loading...