Text categorization: past and present

  • PDF / 1,492,862 Bytes
  • 48 Pages / 439.37 x 666.142 pts Page_size
  • 5 Downloads / 228 Views

DOWNLOAD

REPORT


Text categorization: past and present Ankita Dhar1 · Himadri Mukherjee1 · Niladri Sekhar Dash2 · Kaushik Roy1 

© Springer Nature B.V. 2020

Abstract Automatic text categorization is the operation of sorting out the text documents into predefined text categories using some machine learning algorithms. Normally, it defines the most important approaches to organizing and making the use of a large volume of information exists in unstructured form. Nowadays, text categorization is becoming an extensively researched field of text mining and processing of languages. Word sense, semantic relationships among terms, text documents and categories are quite essential in order of enhancing the performances of categorization. Various surveys on text categorization have already been available which involve techniques of various text representation schemes to such extent but do not include several approaches that have been explored in text categorization over the standard techniques. Here, an exhaustive analysis of different text categorization approaches over the conventional approaches has been undertaken. This survey paper explores a wide variety of algorithms used for categorizing text documents and tries to assemble the existing works into three basic fields: conventional methods, fuzzy logicbased methods, deep learning-based methods. Further, conventional methods have been categorized into three fields: text categorization using handcrafted features, text categorization using nature-inspired algorithms and text categorization using graph-based methods. Furthermore, this survey provides a clear idea about the available libraries used for different algorithms, availability of datasets, categorization technologies explored in various non-Indian and Indian languages as well. Keywords  Text categorization · Conventional methods · Fuzzy logic · Deep learning · Nature-inspired algorithms · Graph-based methods

* Kaushik Roy [email protected] Ankita Dhar [email protected] Himadri Mukherjee [email protected] Niladri Sekhar Dash [email protected] 1

Department of Computer Science, West Bengal State University, Kolkata, India

2

Linguistic Research Unit, Indian Statistical Institute, Kolkata, India



13

Vol.:(0123456789)



A. Dhar et al.

1 Introduction Research on text mining is invariably gaining attention in recent years because of the availability of different origins from which huge amounts of digital data, such as social media, blogs, e-mails, on-line libraries, and others are produced. The blooming of digital text data shows the need for the development of various techniques for processing and classification of texts. Manual ordering, organizing and sorting out the digital text data are a rudimentary issue. The various applications of text mining include text categorization, filtering of text document, text summarization, question/answering system and sentiment or emotion or opinion analysis classification. Natural language processing, data mining and machine learning approaches perform jointly to recognize t