Combining Statistical Information and Semantic Similarity for Short Text Feature Extension

A short text feature extension method combining statistical information and semantic similarity is proposed,Firstly, After defining the contribution of word, mutual information, an associated word-pairs set is generated by comparing the value of mutual in

PDF / 739,998 Bytes
6 Pages / 439.37 x 666.14 pts Page_size
105 Downloads / 292 Views

DOWNLOAD

REPORT

)

College of Computer Science and Engineering, Northwest Normal University, Lanzhou, Gansu, China [email protected] Abstract. A short text feature extension method combining statistical infor‐ mation and semantic similarity is proposed,Firstly, After defining the contri‐ bution of word, mutual information, an associated word-pairs set is generated by comparing the value of mutual information with threshold, then it is taken as the query words set to search for HowNet. For each word-pairs, senses are found in knowledge base HowNet, and semantic similarity of query word-pairs are calculated. Common sememe satisfied condition is added into the original term vector as extended feature, otherwise, semantic relationship is computed and the corresponding sememe is expanded into feature set. The above process is repeated, an extended feature set is finally obtained. Experimental results show the effectiveness of our method. Keywords: Short text · Statistical correlation · Semantic similarity · Hownet · Feature extension

1

Introduction

With the explosion of the network new media and online communication, short texts in diverse forms such as news titles, micro-blogs, instant messages, have become the main stream of information exchange. Most of the traditional classiﬁcation methods are not good at short text classiﬁcation and failed to accomplish the task eﬀectively. Therefore, how to improve the eﬃciency of classifying the mass of short text has become the researching focus. Recently, new classifying methods on short text appeared. Kim [1] proposed a novel language independent semantic (LIS) kernel, which is able to eﬀectively compute the similarity between short text documents. Wang [2] presented a new method to tackle data sparseness problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. Methods mentioned above are mainly pays more attention to the concept and the correlation of texts to obtain the logic structure. Therefore, their classifying performance has been improved a little. Yuan [3] presented a short text feature extension method based on frequent term sets, larger search space of algorithm result in higher time complexity, particularly, when the scale

© IFIP International Federation for Information Processing 2016 Published by Springer International Publishing AG 2016. All Rights Reserved Z. Shi et al. (Eds.): IIP 2016, IFIP AICT 486, pp. 205–210, 2016. DOI: 10.1007/978-3-319-48390-0_21

206

X. Li et al.

of the background knowledge increased, the dimension of feature word set would increase dramatically. A short text feature extension method combining statistical information and semantic similarity was proposed to overcome the drawbacks of the above. The ﬂowchart is shown in Fig. 1. Statistic information

Short text corpus

Feature selection

Initial feature set EF

\

Construct relational wordpair set

Computing mutual informtation

Extension feature set

Query sense and sememe of word-pairs from HowNet Construct senses set

Data Loading...

Combining Statistical Information and Semantic Similarity for Short Text Feature Extension

Recommend Documents

Short Text Feature Extension Based on Improved Frequent Term Sets

Comparison of Text-Based and Feature-Based Semantic Similarity Between Android Apps

Deep Neural Semantic Network for Keywords Extraction on Short Text

Utility of Neural Embeddings in Semantic Similarity of Text Data

Automatic Short Answer Grading Using Corpus-Based Semantic Similarity Measurements

Similarity, Semantic

Semantic Enhanced Top-k Similarity Search on Heterogeneous Information Networks

Semantic Similarity Measures for Topological Link Prediction

Knowledge-driven graph similarity for text classification

Text Semantic Representation

Binary Text Representation for Feature Selection

Visualisation for Semantic Information Systems