Comparative evaluation of text classification techniques using a large diverse Arabic dataset

PDF / 422,947 Bytes
26 Pages / 439.37 x 666.142 pts Page_size
64 Downloads / 189 Views

Comparative evaluation of text classification techniques using a large diverse Arabic dataset Mohammad S. Khorsheed · Abdulmohsen O. Al-Thubaity

Published online: 10 March 2013 © Springer Science+Business Media Dordrecht 2013

Abstract A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naı¨ve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset. Keywords Machine learning · Arabic text categorization · Arabic text classification

1 Introduction Documents are the primary repositories of knowledge; therefore, documentation is the most effective way to illustrate ideas, thoughts, and expertise. The availability of documents in a machine-readable format and handling them in an intelligent way, M. S. Khorsheed (&) · A. O. Al-Thubaity King Abdulaziz City for Science & Technology, P O Box 6086, Riyadh 11442, Saudi Arabia e-mail: [email protected] A. O. Al-Thubaity e-mail: [email protected]

123

514

M. S. Khorsheed, A. O. Al-Thubaity

such as through text classification, will maximize the benefit of the knowledge they contain. Arabic machine-readable texts are available both on the Internet and within government organizations and private enterprises, and they are rapidly increasing day by day. However, whereas automatic text classification is well known in natural language processing communities, little attention has been given to Arabic texts. Text classification—the assignment of free text documents to one or more predefined categories based on their content—is used in various applications, such as e-mail filtering, spam detection, web-page content filtering, automatic message routing, automated indexing of articles, and searching for relevant information on the Web. There are three main phases involved in building a classification system: (a) compilation of the training dataset, (b) selection of the set of features to represent the defined classes, and (c) training the chosen classification algorithm, followed by testing it using the corpus compiled in the first stage. Automated document classification involves taking a set of pre-classified documents as the training set. The training data is then an

Data Loading...

Comparative evaluation of text classification techniques using a large diverse Arabic dataset

Recommend Documents

Topics Classification of Arabic Text in Quran by Using Matlab

A Large-Scale Chinese Short-Text Conversation Dataset

Automatic Arabic Text Summarization Using Analogical Proportions

Arabic text summarization using deep learning approach

SNAD Arabic Dataset for Deep Learning

Downtown Osaka Scene Text Dataset

Text Classification

A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

Inductive Inference for Large Scale Text Classification Kernel Appro

Text Classification Using Multilingual Sentence Embeddings

Text Classification

A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning