An extractive text summarization approach using tagged-LDA based topic modeling

  • PDF / 2,571,177 Bytes
  • 31 Pages / 439.37 x 666.142 pts Page_size
  • 82 Downloads / 213 Views

DOWNLOAD

REPORT


An extractive text summarization approach using tagged-LDA based topic modeling Ruby Rani 1 & D. K. Lobiyal 1 Received: 18 February 2020 / Revised: 26 June 2020 / Accepted: 6 August 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Automatic text summarization is an exertion of contriving the abridged form of a text document covering salient knowledge. Numerous statistical, linguistic, rule-based, and position-based text summarization approaches have been explored for different richresourced languages. For under-resourced languages such as Hindi, automatic text summarization is a challenging task and still an unsolved problem. Another issue with such languages is the unavailability of corpus and the inadequacy of the processing tools. In this paper, we proposed an extractive lexical knowledge-rich topic modeling text summarization approach for Hindi novels and stories in which we implemented four independent variants using different sentence weighting schemes. We prepared a corpus of Hindi Novels and stories since the absence of a corpus. We used a smoothing technique for edifying and variety summaries followed by evaluating the efficacy of generated summaries against three metrics (gist diversity, retention ratio, and ROUGE score). The results manifest that the proposed model produces abridge, articulate and coherent summaries. To investigate the performance of the proposed model, we simulate the experiments on the English dataset as well. Further, we compare our models with the baselines and traditional topic modeling approach, where we show that the proposed model has confessed optimal results. Keywords Topic modeling . Hindi novel . Topic diversity . Retention ratio . Tagged-LDA

1 Introduction Due to the breakthrough in technologies like Big data, cloud computing, wireless communication, sensors, and the internet of things, a huge amount of digital data have congregated on * Ruby Rani [email protected] D. K. Lobiyal [email protected]

1

School of Computer & Systems Sciences, Jawaharlal Nehru University, New Delhi, India

Multimedia Tools and Applications

the internet. From this enormous digital data, the user entails only useful information instantly. Thus, it has become a challenge to excerpt indispensable information from a large corpus effectively and convincingly. One method is to condense the data without losing their proficient content. Some conventional methods need manual effort for condensing document but they demand an insignificant time. Constructing an automatic summary generation system could be effective, in terms of time and human efforts. Recently, in the last decade, a novel Automatic Text Summarization (ATS) system introduced to generate a concise and accurate form of large digital textual information by covering all required information [14]. The main objective of ATS is to collect the relevant and traceable points of a large document into a small space. Nowadays, ATS has numerous effective real-time applications, like in opinion mining [19], review