Keyword extraction using supervised cumulative TextRank

  • PDF / 1,281,223 Bytes
  • 30 Pages / 439.37 x 666.142 pts Page_size
  • 76 Downloads / 193 Views

DOWNLOAD

REPORT


Keyword extraction using supervised cumulative TextRank Monali Bordoloi 1 & Preetam Chayan Chatterjee 1 & Saroj Kumar Biswas 1 & Biswajit Purkayastha 1 Received: 21 October 2019 / Revised: 26 June 2020 / Accepted: 13 July 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Keyword extraction is a major step to extract plenty of valuable and meaningful information from the rich source of World Wide Web (W.W.W.). Different keyword extraction algorithms are proposed with their own advantages and disadvantages. Vector Space Model (VSM) algorithms prove quite effective for keyword extraction, but do not emphasize on the class label information of classified data. Supervised Term Weighting (STW) algorithms address this problem, but suffer from high dimensionality. Besides, they do not incorporate semantic relationship between terms. To address these problems, Graph Based Models (GBM) are introduced. However, they also use unsupervised learning. Hence, this paper proposes a Keyword Extraction using Supervised Cumulative TextRank (KESCT) technique that explores the benefits of both VSM and GBM techniques. The proposed algorithm modifies TextRank by incorporating a novel Unique Statistical Supervised Weight (USSW) to include class label information of classified data. To emphasize on the relatedness between terms, the mutual information between terms is also included. The proposed algorithm is validated using four review datasets and results are compared with traditional TextRank and its variants using Support Vector Machine (SVM) classifier, Naïve-Bayes (NB) classifier and an ensemble classifier. Experimental results mark the efficacy of the proposed algorithm over existing algorithms. Keywords Keyword extraction . Supervised learning . Supervised term weighting . Vector space model . Graph based model . Machine learning classifiers

1 Introduction Keyword extraction is a technique to extract important features from textual data by identifying specific terms, phrases or words from a document so that the document can be represented

* Monali Bordoloi [email protected] Extended author information available on the last page of the article

Multimedia Tools and Applications

in a concise manner [2]. This compact document representation can be helpful in a number of applications, such as automatic information retrieval, plagiarism detection, summarization of documents, sentiment analysis, search engine optimization, automatic text clustering and indexing etc. The emergence of social networking sites such as Facebook, Twitter and various other micro-blogging sites, have flooded the internet with humongous amount of text data. Effective organization, retrieval and mining of the data are the prime focus in the current digital world. Quite evidently, these tasks are not possible with manual intervention, since it is highly time consuming, costly and does not always guarantee accuracy. Let us consider a real world example to understand the benefits of keyword techniques. Google Play provides