Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification
- PDF / 2,256,112 Bytes
- 33 Pages / 595.276 x 790.866 pts Page_size
- 42 Downloads / 258 Views
(0123456789().,-volV)(0123456789(). ,- volV)
ORIGINAL ARTICLE
Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification Muhammad Nabeel Asim1,2,4 • Muhammad Usman Ghani3,4 • Muhammad Ali Ibrahim2,4 Waqar Mahmood4 • Andreas Dengel1,2 • Sheraz Ahmed1
•
Received: 24 February 2020 / Accepted: 2 September 2020 The Author(s) 2020
Abstract In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it provides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the performance impact of traditional machine learning-based Urdu text document classification methodologies by embedding 10 filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it assesses the performance of various deep learning-based methodologies for Urdu text document classification. In this regard, for experimentation, we adapt 10 deep learning classification methodologies which have produced best performance figures for English text classification. Fourth, it also investigates the performance impact of transfer learning by utilizing Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a hybrid approach which combines traditional machine learning-based feature engineering and deep learning-based automated feature engineering. Experimental results show that feature selection approach named as normalized difference measure along with support vector machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32% and 13%, respectively. Across all three datasets, normalized difference measure outperforms other filter-based feature selection algorithms as it significantly uplifts the performance of all adopted machine learning, deep learning, and hybrid approaches. The source code and presented dataset are available at Github repository https://github.com/minixain/Urdu-TextClassification. Keywords Urdu text document classification Urdu news classification Urdu news genre categorization Multi-class Urdu text categorization computational methodologies Deep neural networks BERT
& Muhammad Nabeel Asim [email protected]
1
German Research Center for Artificial Intelligence (DFKI) GmbH, 67663 Kaiserslautern, Germany
Muhammad Usman Ghani [email protected]
2
Technische Universita¨t Kaiserslautern, 67663 Kaiserslautern, Germany
Muhammad Ali Ibrahim [email protected]
3
Department of Computer Science, University of Engineering and Technology (UET), Lahore, Pakistan
Waqar Mahmood [email protected]
4
Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan
Andreas Dengel [email protected] Sheraz Ahmed Sheraz.Ahmed@dfki.
Data Loading...