Research on Hot Topic Discovery Technology of Micro-blog Based on Biterm Topic Model

In order to overcome data sparsity and expression diversity problems of short text and to improve the quality of clustering, this paper proposes a text feature enhancement method based on biterm topic model (BTM). First, we obtain the high frequency word

PDF / 258,647 Bytes
11 Pages / 439.37 x 666.142 pts Page_size
10 Downloads / 242 Views

DOWNLOAD

REPORT

Abstract. In order to overcome data sparsity and expression diversity problems of short text and to improve the quality of clustering, this paper proposes a text feature enhancement method based on biterm topic model (BTM). First, we obtain the high frequency word matrix of underlying topic based on the extraction on the corpus using BTM and then strengthen the traditional vector space model (VSM) selectively with this matrix to reduce vector dimension and highlight the main features. Also, we propose a heat calculation equation combining with propagation characteristic and time effect of micro-blogs so that we can better demonstrate the evolution of a topic and analyze it. Experiments show that our method has achieved good results in improving the clustering quality and the heat calculation equation is also beneﬁcial to the discovery and evolution of hot topics. Keywords: Biterm topic model Feature enhancement Topic discovery Hot topic evolution

1 Introduction The hot topic discovery technology using clustering analysis or topic extraction is to dig out meaningful content to which users pay their attentions from a large amount of information. It belongs to Topic Detection and Tracking (TDT) [1] and can be used in the entire area or in a speciﬁc domain. For some hot topics, it can completely ﬁnd the attitudes of people and the subsequent of popular events. More important is that hot topic discovery can ﬁnd some emerging hot topics without a lot of reports. As one of the most important micro media forms, micro-blog has lots of features such as wide information coverage, real-time, highly interactive and simple metadata. However, this short text will suffer from severe data sparsity problem and its oral and diverse expression is not conducive to the selection of characteristics too. To solve the problems above, this paper focuses on the research on data sparsity and expression diversity by applying biterm topic model (BTM) [2] in topic extraction of micro-blogs and strengthening the VSM by topic-word matrix. This can reduce vector dimension and preserve more original information at the same time. Also, we merge words that potentially express the same topic to solve the diversity problem. What’s more, we improve the K-means algorithm with propagation characteristic and time © Springer Nature Singapore Pte Ltd. 2017 H. Yuan et al. (Eds.): GRMSE 2016, Part II, CCIS 699, pp. 234–244, 2017. DOI: 10.1007/978-981-10-3969-0_27

Research on Hot Topic Discovery Technology

235

effect. The K value is adaptive and the clustering is incremental and a heat calculation equation is proposed to describe the degree of hot events and their evolution process.

2 Related Works 2.1

Research on Short Text Clustering

As a typical short text, micro-blog will suffer from severe data sparsity problem. At present, the improvement of short text clustering are mainly based on feature selection. In 2002, FTC algorithm [3] proposed by Beil et al. holds that some speciﬁc words will show in documents sharing the same category. This means t

Data Loading...

Research on Hot Topic Discovery Technology of Micro-blog Based on Biterm Topic Model

Recommend Documents

A Topic Evolution Model Based on Microblog Network

A User Group Classification Model Based on Sentiment Analysis Under Microblog Hot Topic

Improving biterm topic model with word embeddings

Hot Topic Commentary on COVID-19

A Hot Topic Detection Approach on Chinese Microblogging

Topic representation model based on microblogging behavior analysis

Blog Topic Diffusion Prediction Model Based on Link Information Flow

Topic Information Collection Based on the Hidden Markov Model

Topic Logistics Based on Node Resource Status

Understanding Topic Influence Based on Module Network

Discovery of topic flows of authors

Topic-based Publish/Subscribe