Comparative evaluation of text classification techniques using a large diverse Arabic dataset
- PDF / 422,947 Bytes
- 26 Pages / 439.37 x 666.142 pts Page_size
- 64 Downloads / 166 Views
Comparative evaluation of text classification techniques using a large diverse Arabic dataset Mohammad S. Khorsheed · Abdulmohsen O. Al-Thubaity
Published online: 10 March 2013 © Springer Science+Business Media Dordrecht 2013
Abstract A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naı¨ve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset. Keywords Machine learning · Arabic text categorization · Arabic text classification
1 Introduction Documents are the primary repositories of knowledge; therefore, documentation is the most effective way to illustrate ideas, thoughts, and expertise. The availability of documents in a machine-readable format and handling them in an intelligent way, M. S. Khorsheed (&) · A. O. Al-Thubaity King Abdulaziz City for Science & Technology, P O Box 6086, Riyadh 11442, Saudi Arabia e-mail: [email protected] A. O. Al-Thubaity e-mail: [email protected]
123
514
M. S. Khorsheed, A. O. Al-Thubaity
such as through text classification, will maximize the benefit of the knowledge they contain. Arabic machine-readable texts are available both on the Internet and within government organizations and private enterprises, and they are rapidly increasing day by day. However, whereas automatic text classification is well known in natural language processing communities, little attention has been given to Arabic texts. Text classification—the assignment of free text documents to one or more predefined categories based on their content—is used in various applications, such as e-mail filtering, spam detection, web-page content filtering, automatic message routing, automated indexing of articles, and searching for relevant information on the Web. There are three main phases involved in building a classification system: (a) compilation of the training dataset, (b) selection of the set of features to represent the defined classes, and (c) training the chosen classification algorithm, followed by testing it using the corpus compiled in the first stage. Automated document classification involves taking a set of pre-classified documents as the training set. The training data is then an
Data Loading...