A term correlation based semi-supervised microblog clustering with dual constraints

PDF / 2,653,078 Bytes
14 Pages / 595.276 x 790.866 pts Page_size
110 Downloads / 272 Views

ORIGINAL ARTICLE

A term correlation based semi‑supervised microblog clustering with dual constraints Huifang Ma1,2 · Di Zhang1 · Meihuizi Jia1 · Xianghong Lin1 Received: 10 September 2015 / Accepted: 15 November 2017 © Springer-Verlag GmbH Germany, part of Springer Nature 2017

Abstract Microblog clustering is very important in many web applications. However, microblogs do not provide sufficient word occurrences. Meanwhile the limited length of these messages prevents traditional text clustering approaches from being employed to their full potential. To address this problem, in this paper, we propose a novel semi-supervised learning scheme fully exploring the semantic information to compensate for the limited message length. The key idea is to explore term correlation data, which well captures the semantic information for term weighting and provides greater context for microblogs. We then formulate microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework, which takes advantage of both prior domain knowledge of data points (microblogs) in the form of pair-wise constraints and category knowledge of features (terms). Our approach not only greatly reduces the labor-intensive labeling process, but also deeply exploits hidden information from microblog itself. Extensive experiments are conducted on two real-world microblog datasets. The results demonstrate the effectiveness of the proposed approach which produces promising performance as compared to state-of-the-art methods. Keywords Semi-supervised clustering · Microblogs · Dual constraints · Term correlation matrix · Nonnegative matrix factorization

1 Introduction Online social networks are becoming more and more popular in recent years. Microblog platforms such as Sina or Twitter, have become important real-time information resources for breaking-news disseminating, information sharing, and events participation [1]. Users can take advantage of this service to express their ideas and intentions using short textual snippets on a daily and even hourly basis. When it comes to use, users often have to browse through large amount of information in order to locate things they are interested in. Obtaining a meaningful cluster hierarchy for a microblog corpus can therefore be a major way for organization of these short, ambiguous and even vague microblogs. Approaches to * Huifang Ma [email protected] 1

College of Computer science and engineering, Northwest Normal University, Lanzhou 730070, Gansu, China

The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100085, China

2

solve this problem have mainly focused on clustering algorithms [2–4]. Similar with document, traditional representation method for microblog is n-dimensional vector, where n is the number of terms appearing in the dictionary, and each vector component reflects the importance of the corresponding term with respect to the semantics of the microblog. Microblog corpus can

Data Loading...

A term correlation based semi-supervised microblog clustering with dual constraints

Recommend Documents

Clustering with Constraints

Correlation Clustering

Bayesian Active Clustering with Pairwise Constraints

Kernel Fuzzy C Means Clustering with New Spatial Constraints

Erratum to: Bayesian Active Clustering with Pairwise Constraints

Robust Fuzzy Clustering via Trimming and Constraints

A Topic Evolution Model Based on Microblog Network

Latent Space Clustering via Dual Discriminator GAN

Correlation of Aortic Intima-Media Thickness With Birthweight in Healthy Term and Near Term Neonates

Performance Analysis of Clustering Algorithm in Sensing Microblog for Smart Cities

Satellite Dual-Polarization Radar Imagery Superresolution Under Physical Constraints

A User Group Classification Model Based on Sentiment Analysis Under Microblog Hot Topic