Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application
- PDF / 1,173,700 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 48 Downloads / 201 Views
Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application Hadeel N. Alshaer 1 & Mohammed A. Otair 1 & Laith Abualigah 1 & Mohammad Alshinwan 1 & Ahmad M. Khasawneh 1 Received: 6 June 2020 / Revised: 31 August 2020 / Accepted: 13 October 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Text classification could be defined as the way of allocating text into predefined groups according to its contents. Over the past few years, an increase emerged in the volume of information in the varied fields on the Internet, thus making the classification of texts one of the most important, yet challenging. Text classification is commonly employed in numerous applications and for different objectives. The extensive and broad use of the Internet, particularly in the Arab world, as well as the massive number of the documents and pages which are provided in the Arabic language, raised the need for having suitable tools for classification of these pages and documents by their main categories. The aim of this paper to study the effect of the improved CHI (ImpCHI) Square on the performance of six well-known classifiers: Random Forest, Decision Tree, Naïve Bayes, Naïve Bayes Multinomial, Bayes Net, and Artificial Neural Networks. These proposed techniques are quite important for improving classification of Arabic documents and can be regarded as a promising basis for the stage of text classification because it contributes to the classification of the texts into predefined categories. This combination method takes the advantages of more than one technique, which can produce better results in the final outcomes. The dataset employed in this paper includes 9055 Arabic documents that were collected from various Arabic resources. Based on their content, these documents were divided into twelve categories. Four performance evaluation criteria were used: the F-measure, recall, precision, and Time build model. The experimental results show that the use of ImpCHI square gives better classification results than the normal CHI square method with all studied classifiers, in terms of all used performance criteria. Keywords Text classification algorithms . Bayes net . Naïve Bayes . Random Forest . Decision tree . Artificial neural networks . CHI Square
* Laith Abualigah [email protected] Extended author information available on the last page of the article
Multimedia Tools and Applications
1 Introduction Information Retrieval (IR) is a field of computer science of great importance in our time because of the increasing volume of information. This information may need to be arranged and classified so that it can be easily retrieved. Text classification (TC) a process that has been emerged importantly in various fields, especially in areas on the Internet. Text mining is a textual analysis of data in natural language text and seeks to extract useful information from textual data. Besides, text mining helps organizations extract valuable ideas from document content.
Data Loading...