Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification

PDF / 2,256,112 Bytes
33 Pages / 595.276 x 790.866 pts Page_size
42 Downloads / 408 Views

(0123456789().,-volV)(0123456789(). ,- volV)

ORIGINAL ARTICLE

Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification Muhammad Nabeel Asim1,2,4 • Muhammad Usman Ghani3,4 • Muhammad Ali Ibrahim2,4 Waqar Mahmood4 • Andreas Dengel1,2 • Sheraz Ahmed1

•

Received: 24 February 2020 / Accepted: 2 September 2020 The Author(s) 2020

Abstract In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it provides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the performance impact of traditional machine learning-based Urdu text document classification methodologies by embedding 10 filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it assesses the performance of various deep learning-based methodologies for Urdu text document classification. In this regard, for experimentation, we adapt 10 deep learning classification methodologies which have produced best performance figures for English text classification. Fourth, it also investigates the performance impact of transfer learning by utilizing Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a hybrid approach which combines traditional machine learning-based feature engineering and deep learning-based automated feature engineering. Experimental results show that feature selection approach named as normalized difference measure along with support vector machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32% and 13%, respectively. Across all three datasets, normalized difference measure outperforms other filter-based feature selection algorithms as it significantly uplifts the performance of all adopted machine learning, deep learning, and hybrid approaches. The source code and presented dataset are available at Github repository https://github.com/minixain/Urdu-TextClassification. Keywords Urdu text document classification Urdu news classification Urdu news genre categorization Multi-class Urdu text categorization computational methodologies Deep neural networks BERT

& Muhammad Nabeel Asim [email protected]

1

German Research Center for Artificial Intelligence (DFKI) GmbH, 67663 Kaiserslautern, Germany

Muhammad Usman Ghani [email protected]

2

Technische Universita¨t Kaiserslautern, 67663 Kaiserslautern, Germany

Muhammad Ali Ibrahim [email protected]

3

Department of Computer Science, University of Engineering and Technology (UET), Lahore, Pakistan

Waqar Mahmood [email protected]

4

Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan

Andreas Dengel [email protected] Sheraz Ahmed Sheraz.Ahmed@dfki.

Data Loading...

Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification

Recommend Documents

Correction to: Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classifi

Label-Wise Document Pre-training for Multi-label Text Classification

Text Document Classification with PCA and One-Class SVM

Text Segmentation for Document Recognition

Text/Document Summarization

Deep Dependency Network for Multi-label Text Classification

Deep Active Learning with Simulated Rationales for Text Classification

Machine Learning for Text

Machine Leaning Based Urdu Language Tutor for Primary School Students

Topic modeling combined with classification technique for extractive multi-document text summarization

A New Evolving Tree for Text Document Clustering and Visualization

Performance Benchmarking Measuring and Managing Performance