Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification

  • PDF / 2,256,112 Bytes
  • 33 Pages / 595.276 x 790.866 pts Page_size
  • 42 Downloads / 259 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

ORIGINAL ARTICLE

Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification Muhammad Nabeel Asim1,2,4 • Muhammad Usman Ghani3,4 • Muhammad Ali Ibrahim2,4 Waqar Mahmood4 • Andreas Dengel1,2 • Sheraz Ahmed1



Received: 24 February 2020 / Accepted: 2 September 2020  The Author(s) 2020

Abstract In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it provides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the performance impact of traditional machine learning-based Urdu text document classification methodologies by embedding 10 filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it assesses the performance of various deep learning-based methodologies for Urdu text document classification. In this regard, for experimentation, we adapt 10 deep learning classification methodologies which have produced best performance figures for English text classification. Fourth, it also investigates the performance impact of transfer learning by utilizing Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a hybrid approach which combines traditional machine learning-based feature engineering and deep learning-based automated feature engineering. Experimental results show that feature selection approach named as normalized difference measure along with support vector machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32% and 13%, respectively. Across all three datasets, normalized difference measure outperforms other filter-based feature selection algorithms as it significantly uplifts the performance of all adopted machine learning, deep learning, and hybrid approaches. The source code and presented dataset are available at Github repository https://github.com/minixain/Urdu-TextClassification. Keywords Urdu text document classification  Urdu news classification  Urdu news genre categorization  Multi-class Urdu text categorization computational methodologies  Deep neural networks  BERT

& Muhammad Nabeel Asim [email protected]

1

German Research Center for Artificial Intelligence (DFKI) GmbH, 67663 Kaiserslautern, Germany

Muhammad Usman Ghani [email protected]

2

Technische Universita¨t Kaiserslautern, 67663 Kaiserslautern, Germany

Muhammad Ali Ibrahim [email protected]

3

Department of Computer Science, University of Engineering and Technology (UET), Lahore, Pakistan

Waqar Mahmood [email protected]

4

Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan

Andreas Dengel [email protected] Sheraz Ahmed Sheraz.Ahmed@dfki.