Topic modeling combined with classification technique for extractive multi-document text summarization

  • PDF / 1,524,056 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 44 Downloads / 279 Views

DOWNLOAD

REPORT


METHODOLOGIES AND APPLICATION

Topic modeling combined with classification technique for extractive multi-document text summarization Rajendra Kumar Roul1

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract The qualities of human readable summaries available in the datasets are not up to the mark, leading to issues in creating an accurate model for text summarization. Although recent works have been largely built upon this issue and set up a strong platform for further improvements, they still have many limitations. Looking in this direction, the paper proposes a novel methodology for summarizing a corpus of documents to generate a coherent summary using topic modeling and classification technique. The objectives of the propose work are highlighted below: • A novel heuristic approach is introduced to find out the actual number of topics that exist in a corpus of documents which handles the stochastic nature of latent dirichlet allocation. • A large corpus of documents is handled by minimizing the huge set of sentences into a small set without losing the important one and thus providing a concise and information rich summary at the end. • Ensuring that the sentences are arranged as per their importance in the coherent summary. • Results of the experiment are compared with the state-of-the-art summary systems. The outcomes of the empirical work show that the proposed model is more promising compared to the well-known text summarization models. Keywords Classification · Extractive · LDA · ROUGE · Silhouette · Summarization · Topic modeling

1 Introduction The tremendous growth of the internet and portable computing systems has resulted a boom in the amount of data generated. Analysis and interpretation of such large amounts of data using various technologies and generating a brief summary is one of the most active research areas in the domain of computer science. Recognizing the main points of a text and expressing them in a shorter document is what we call text summarization (Miller 1995). The first technique for automatic text summarization was formulated and published around 60 years back (Luhn 1958). Early methods

mostly focused on term frequency as a criterion for ranking the importance of sentences. However, those computing systems were not powerful enough to carry out such complex tasks on large volume of data and hence, the progress in this field was limited. In spite of such constraints, researchers continued to hypothesize newer and more efficient methods to improve the accuracy of the generated summaries. Some of the ideas introduced are related to the ‘importance of certain keywords in documents’, ‘importance of the position of sentences in documents’, ‘similarity among sentences of a document’ etc. Based on the summary generated from a corpus, the summarization process is classified into two categories (Valizadeh and Brazdil 2015):

Communicated by V. Loia.

B 1

Rajendra Kumar Roul [email protected] Department of Computer Science and Engineering, Thapar Intitute of Engineering and