Knowledge-driven graph similarity for text classification

  • PDF / 2,081,788 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 16 Downloads / 221 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Knowledge‑driven graph similarity for text classification Niloofer Shanavas1   · Hui Wang1 · Zhiwei Lin1 · Glenn Hawe1 Received: 16 November 2019 / Accepted: 3 October 2020 © The Author(s) 2020

Abstract Automatic text classification using machine learning is significantly affected by the text representation model. The structural information in text is necessary for natural language understanding, which is usually ignored in vector-based representations. In this paper, we present a graph kernel-based text classification framework which utilises the structural information in text effectively through the weighting and enrichment of a graph-based representation. We introduce weighted co-occurrence graphs to represent text documents, which weight the terms and their dependencies based on their relevance to text classification. We propose a novel method to automatically enrich the weighted graphs using semantic knowledge in the form of a word similarity matrix. The similarity between enriched graphs, knowledge-driven graph similarity, is calculated using a graph kernel. The semantic knowledge in the enriched graphs ensures that the graph kernel goes beyond exact matching of terms and patterns to compute the semantic similarity of documents. In the experiments on sentiment classification and topic classification tasks, our knowledge-driven similarity measure significantly outperforms the baseline text similarity measures on five benchmark text classification datasets. Keywords  Automatic text classification · Document similarity measure · Graph-based text representation · Graph enrichment · Graph kernels · Supervised term weighting · SVM

1 Introduction Research on automatic text classification has gained importance due to the information overload problem and the need for faster and more accurate extraction of knowledge from huge data sources. Text classification assigns predefined labels to documents based on their content. An important step in automatic text classification is the effective representation of text. Bag-of-words is the most commonly used text representation scheme and is based on term independence assumption, where a text document is regarded as a set of unordered terms and is represented as a vector. It is simple and fast, but ignores the structural information in text such as the syntactic and semantic information. In contrast, the graph-based representation scheme is much more expressive than the bag-of-words representation, and can represent structural information such as term dependencies. It has

* Niloofer Shanavas shanavas‑[email protected] 1



School of Computing, Ulster University, Jordanstown BT37 0QB, UK

been shown that graph-based representation can outperform bag-of-words representation [12, 18, 26–28, 32, 37]. Document similarity is used in many text processing tasks such as text classification, clustering and information retrieval. Document similarity is usually measured as the distance/similarity between the vector representations of text documents under the assumption t