A Novel Graph Partitioning Criterion Based Short Text Clustering Method

A novel clustering method based on spectral clustering theory and spectral cut standard is proposed via analyzing the characteristics of short text and the defects of the existing clustering algorithms. First of all, a weighted undirected graph is created

  • PDF / 729,875 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 89 Downloads / 237 Views

DOWNLOAD

REPORT


Abstract. A novel clustering method based on spectral clustering theory and spectral cut standard is proposed via analyzing the characteristics of short text and the defects of the existing clustering algorithms. First of all, a weighted undirected graph is created according to spectral clustering theory, similarity between node and node is calculated on graph, and a symmetrical documents similarity matrix is constructed, which provides all information for the clustering algorithm. Inspired by Greedy strategy, we utilize prim to develop PrimMAE algorithm for the purpose of partitioning graph into two parts, in which RMcut is termination condition of partitioning process, and then it is fed into CASC algorithm to cut the documents set iteratively. Ultimately, high quality clustering results demonstrate the effectiveness of the new clustering algorithm. Keywords: Short text  Similarity matrix  RMcut criterion  Prim algorithm Clustering algorithm



1 Introduction With the increasing growth of the internet, social media based on it such as news title, micro-blog, instant message and so on has become increasingly popular, which produces a large number of information every day, and the key here is that most of the information is appeared in the form of short text. Different from traditional text, short text is usually of the following characteristics [1]: (1) Growing dynamically. That is to say, they have a rapid growth rate. (2) Text is short. Usually there are fewer words in short texts, so they are lack of sufficient statistical information for effective similarity measure. (3) The number of short texts is huge. The existing text clustering methods cannot meet the need of theoretical analysis for the emerging wide range of short texts. Traditional document clustering methods cannot accomplish the task effectively, we need to research and design more suitable methods. Therefore how to improve the efficiency of clustering the mass of short text through information fusion has become the researching focus [2]. In recent years, clustering algorithms for short texts are constantly raised. Tang et al. [3] extended short text representation via a multi-language representation method by means of machine translation. Wang et al. [4] proposed an extension of VSM (vector space model) for communication short texts, called WR-KMeans. Yin and Wang [5] calculated the semantic distance between two word strings as balance of the © Springer International Publishing Switzerland 2016 D.-S. Huang et al. (Eds.): ICIC 2016, Part III, LNAI 9773, pp. 338–348, 2016. DOI: 10.1007/978-3-319-42297-8_32

A Novel Graph Partitioning Criterion

339

extent of word sequence alignment and the meaning matching between word strings. Peng et al. [6] explored the text characteristic of high dimensions and sparse space, proposed a novel text clustering algorithm based on semantic inner space model. Methods mentioned above are mainly focused on how to carry out feature representation and similarity calculation of short texts. However, the clustering perfo