CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus
- PDF / 1,130,808 Bytes
- 10 Pages / 595.276 x 790.866 pts Page_size
- 101 Downloads / 177 Views
Indian Academy of Sciences Sadhana(0123456789().,-volV)FT3 ](0123456789().,-volV)
CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus PRAKASH CHOUDHARY1,* and NEETA NAIN2 1
Department of Computer Science and Engineering, National Institute of Technology Hamirpur, Hamirpur 177005, Himachal Pradesh, India 2 Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, Jaipur, India e-mail: [email protected] MS received 30 September 2015; revised 24 November 2018; accepted 7 September 2019 Abstract. In this paper, we introduce an efficient framework for the compilation of an Urdu corpus along with ground truth and transcription in Unicode format. A novel scheme of the annotation based on four-level XML has been incorporated for the corpus CALAM. In addition to compilation and benchmarking test, the framework generates the word frequency distribution according to category sapient useful for linguistic evaluation. This paper presents the statistical analysis with corpus data based on transcript text and frequency of occurrences. The observation of statistical analysis is conducted using vital statistics like rank of words, the frequency of words, ligatures length (number of ligatures with combination of two to seven characters), entropy and perplexity of the corpus. Besides rudimental statistics coverage, some additional statistical features are also evaluated like Zipf’s linguistic rule and measurement of dispersion in corpus information. The experimental results obtained from statistical observation are presented for asserting viability and usability of the corpus data as a standard platform for linguistic research on the Urdu language. Keywords.
Corpus statistical analysis; Zipf’s rule; quantitative analysis; linguistic evaluation; corpus; NLP.
1. Introduction Over the last five decades, the corpus methodology is the most growing and widely spread technology in the linguistic domain. From the linguistic perspective, a corpus is defined as an amassment of texts compiled in a systematic way to provide a platform for sundry linguistic research. The reliability of a corpus depends on the coverage of the optimal texts and quality of the texts culled in the corpus. The recent advancement of a computer-based compilation of the corpus makes it even more facile and has opened many incipient areas for research in the natural language processing [1]. A spare variety of standard handwritten databases were developed for the scripts like English [2], Chinese HITMW database [3], Japanese ETL9 [4], FHT database for Farsi [5] and Arabic database [6]. PBOK (Persian, Bangla, Oriya and Kannada) database is a multilingual handwritten database, developed for four scripts [7]. As compared to those languages, very less attention has been given to the Urdu language. In literature for the Urdu language, thus far, solely two handwritten databases exist. First, CENPARMI [8] is an *For correspondence
Urdu offline handwriting database, which incorporates the isolated digits, numer
Data Loading...