Combining Statistical Information and Semantic Similarity for Short Text Feature Extension

A short text feature extension method combining statistical information and semantic similarity is proposed,Firstly, After defining the contribution of word, mutual information, an associated word-pairs set is generated by comparing the value of mutual in

  • PDF / 739,998 Bytes
  • 6 Pages / 439.37 x 666.14 pts Page_size
  • 105 Downloads / 193 Views

DOWNLOAD

REPORT


)

College of Computer Science and Engineering, Northwest Normal University, Lanzhou, Gansu, China [email protected] Abstract. A short text feature extension method combining statistical infor‐ mation and semantic similarity is proposed,Firstly, After defining the contri‐ bution of word, mutual information, an associated word-pairs set is generated by comparing the value of mutual information with threshold, then it is taken as the query words set to search for HowNet. For each word-pairs, senses are found in knowledge base HowNet, and semantic similarity of query word-pairs are calculated. Common sememe satisfied condition is added into the original term vector as extended feature, otherwise, semantic relationship is computed and the corresponding sememe is expanded into feature set. The above process is repeated, an extended feature set is finally obtained. Experimental results show the effectiveness of our method. Keywords: Short text · Statistical correlation · Semantic similarity · Hownet · Feature extension

1

Introduction

With the explosion of the network new media and online communication, short texts in diverse forms such as news titles, micro-blogs, instant messages, have become the main stream of information exchange. Most of the traditional classification methods are not good at short text classification and failed to accomplish the task effectively. Therefore, how to improve the efficiency of classifying the mass of short text has become the researching focus. Recently, new classifying methods on short text appeared. Kim [1] proposed a novel language independent semantic (LIS) kernel, which is able to effectively compute the similarity between short text documents. Wang [2] presented a new method to tackle data sparseness problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. Methods mentioned above are mainly pays more attention to the concept and the correlation of texts to obtain the logic structure. Therefore, their classifying performance has been improved a little. Yuan [3] presented a short text feature extension method based on frequent term sets, larger search space of algorithm result in higher time complexity, particularly, when the scale

© IFIP International Federation for Information Processing 2016 Published by Springer International Publishing AG 2016. All Rights Reserved Z. Shi et al. (Eds.): IIP 2016, IFIP AICT 486, pp. 205–210, 2016. DOI: 10.1007/978-3-319-48390-0_21

206

X. Li et al.

of the background knowledge increased, the dimension of feature word set would increase dramatically. A short text feature extension method combining statistical information and semantic similarity was proposed to overcome the drawbacks of the above. The flowchart is shown in Fig. 1. Statistic information

Short text corpus

Feature selection

Initial feature set EF

\

Construct relational wordpair set

Computing mutual informtation

Extension feature set

Query sense and sememe of word-pairs from HowNet Construct senses set