A term correlation based semi-supervised microblog clustering with dual constraints
- PDF / 2,653,078 Bytes
- 14 Pages / 595.276 x 790.866 pts Page_size
- 110 Downloads / 177 Views
ORIGINAL ARTICLE
A term correlation based semi‑supervised microblog clustering with dual constraints Huifang Ma1,2 · Di Zhang1 · Meihuizi Jia1 · Xianghong Lin1 Received: 10 September 2015 / Accepted: 15 November 2017 © Springer-Verlag GmbH Germany, part of Springer Nature 2017
Abstract Microblog clustering is very important in many web applications. However, microblogs do not provide sufficient word occurrences. Meanwhile the limited length of these messages prevents traditional text clustering approaches from being employed to their full potential. To address this problem, in this paper, we propose a novel semi-supervised learning scheme fully exploring the semantic information to compensate for the limited message length. The key idea is to explore term correlation data, which well captures the semantic information for term weighting and provides greater context for microblogs. We then formulate microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework, which takes advantage of both prior domain knowledge of data points (microblogs) in the form of pair-wise constraints and category knowledge of features (terms). Our approach not only greatly reduces the labor-intensive labeling process, but also deeply exploits hidden information from microblog itself. Extensive experiments are conducted on two real-world microblog datasets. The results demonstrate the effectiveness of the proposed approach which produces promising performance as compared to state-of-the-art methods. Keywords Semi-supervised clustering · Microblogs · Dual constraints · Term correlation matrix · Nonnegative matrix factorization
1 Introduction Online social networks are becoming more and more popular in recent years. Microblog platforms such as Sina or Twitter, have become important real-time information resources for breaking-news disseminating, information sharing, and events participation [1]. Users can take advantage of this service to express their ideas and intentions using short textual snippets on a daily and even hourly basis. When it comes to use, users often have to browse through large amount of information in order to locate things they are interested in. Obtaining a meaningful cluster hierarchy for a microblog corpus can therefore be a major way for organization of these short, ambiguous and even vague microblogs. Approaches to * Huifang Ma [email protected] 1
College of Computer science and engineering, Northwest Normal University, Lanzhou 730070, Gansu, China
The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100085, China
2
solve this problem have mainly focused on clustering algorithms [2–4]. Similar with document, traditional representation method for microblog is n-dimensional vector, where n is the number of terms appearing in the dictionary, and each vector component reflects the importance of the corresponding term with respect to the semantics of the microblog. Microblog corpus can
Data Loading...