Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-Generated Contents on the Social Web
In Arab countries, the dialect is daily gaining ground in the social interaction on the web and swiftly adapting to globalization. Strengthening the relationship of its practitioners with the outside world and facilitating their social exchanges, the dial
- PDF / 1,119,100 Bytes
- 12 Pages / 439.37 x 666.14 pts Page_size
- 25 Downloads / 202 Views
Université de Tunis, ENSIT, 1008 Montfleury, Tunisia [email protected], [email protected] 2 Université de Tunis, ISGT, LR99ES04 BESTMOD, 2000 Le Bardo, Tunisia [email protected]
Abstract. In Arab countries, the dialect is daily gaining ground in the social interaction on the web and swiftly adapting to globalization. Strengthening the relationship of its practitioners with the outside world and facilitating their social exchanges, the dialect encompasses every day new transcriptions that arouse the curiosity of researchers in the NLP community. In this article, we focus specifically on the Tunisian dialect processing. Our goal is to build corpora and dictionaries allowing us to begin our study of this language and to identify its specificities. As a first step, we extract textual user-generated contents on the social Web, we then conduct an automatic content filtering and classification, leaving only the texts containing Tunisian dialect. Finally, we present some of its salient features from the built corpora. Keywords: Tunisian dialect · Language identification · Corpus construction · Dictionary construction · Social web textual contents
1
Introduction
The Arabic language is characterized by its plurality. It consists of a wide variety of languages, which include the modern standard Arabic (MSA), and a set of various dialects differing according to regions and countries. The MSA is one of the written forms of Arabic that is standardized and represents the official language of Arab countries. It is the written form generally used in press, media, official documents, and that is taught in schools. Dialects are regional variations that represent naturally spoken languages by Arab populations. They are largely influenced by the local historical and cultural specificities of the Arab countries [1]. They can be very different from each other and also present significant dissimilarities with the MSA. While many efforts have been undertaken during the last two decades for the automatic processing of MSA, the interest in processing dialects is quite recent and related works are relatively few. Most of the Arabic dialects are today underresourced languages and some of them are unresourced. Our work is part of the contributions to automatic processing of the Tunisian dialect (TD). The latter faces a © Springer International Publishing Switzerland 2015 F. Daniel and O. Diaz (Eds.): ICWE 2015 Workshops, LNCS 9396, pp. 3–14, 2015. DOI: 10.1007/978-3-319-24800-4_1
4
J. Younes et al.
major difficulty which is the almost total absence of resources (corpora and lexica), useful for developing TD processing tools such as morphological analyzers, POS taggers, information extraction tools, etc. As Arabic materials are written essentially in MSA, we propose in this work to exploit informal textual content generated by Tunisian users on the Internet, particularly their exchanges on social networks, for harvesting texts in TD and building TD language resources. Indeed, social exchanges have undergone a swift evol
Data Loading...