Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-Generated Contents on the Social Web

In Arab countries, the dialect is daily gaining ground in the social interaction on the web and swiftly adapting to globalization. Strengthening the relationship of its practitioners with the outside world and facilitating their social exchanges, the dial

PDF / 1,119,100 Bytes
12 Pages / 439.37 x 666.14 pts Page_size
25 Downloads / 322 Views

DOWNLOAD

REPORT

Université de Tunis, ENSIT, 1008 Montfleury, Tunisia [email protected], [email protected] 2 Université de Tunis, ISGT, LR99ES04 BESTMOD, 2000 Le Bardo, Tunisia [email protected]

Abstract. In Arab countries, the dialect is daily gaining ground in the social interaction on the web and swiftly adapting to globalization. Strengthening the relationship of its practitioners with the outside world and facilitating their social exchanges, the dialect encompasses every day new transcriptions that arouse the curiosity of researchers in the NLP community. In this article, we focus specifically on the Tunisian dialect processing. Our goal is to build corpora and dictionaries allowing us to begin our study of this language and to identify its specificities. As a first step, we extract textual user-generated contents on the social Web, we then conduct an automatic content filtering and classification, leaving only the texts containing Tunisian dialect. Finally, we present some of its salient features from the built corpora. Keywords: Tunisian dialect · Language identification · Corpus construction · Dictionary construction · Social web textual contents

1

Introduction

The Arabic language is characterized by its plurality. It consists of a wide variety of languages, which include the modern standard Arabic (MSA), and a set of various dialects differing according to regions and countries. The MSA is one of the written forms of Arabic that is standardized and represents the official language of Arab countries. It is the written form generally used in press, media, official documents, and that is taught in schools. Dialects are regional variations that represent naturally spoken languages by Arab populations. They are largely influenced by the local historical and cultural specificities of the Arab countries [1]. They can be very different from each other and also present significant dissimilarities with the MSA. While many efforts have been undertaken during the last two decades for the automatic processing of MSA, the interest in processing dialects is quite recent and related works are relatively few. Most of the Arabic dialects are today underresourced languages and some of them are unresourced. Our work is part of the contributions to automatic processing of the Tunisian dialect (TD). The latter faces a © Springer International Publishing Switzerland 2015 F. Daniel and O. Diaz (Eds.): ICWE 2015 Workshops, LNCS 9396, pp. 3–14, 2015. DOI: 10.1007/978-3-319-24800-4_1

4

J. Younes et al.

major difficulty which is the almost total absence of resources (corpora and lexica), useful for developing TD processing tools such as morphological analyzers, POS taggers, information extraction tools, etc. As Arabic materials are written essentially in MSA, we propose in this work to exploit informal textual content generated by Tunisian users on the Internet, particularly their exchanges on social networks, for harvesting texts in TD and building TD language resources. Indeed, social exchanges have undergone a swift evol

Data Loading...

Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-Generated Contents on the Social Web

Recommend Documents

A NooJ Tunisian Dialect Translator

The Social Semantic Web

Recommender Systems for the Social Web

Media Competence and Media Performance in Using the Social Web

The Geospatial Web How Geobrowsers, Social Software and the Web

Classification Based Method for Disfluencies Detection in Spontaneous Spoken Tunisian Dialect

Context Matters: The Effect of Textual Tone on the Evaluation of Mediated Social Touch

Algerian Dialect Translation Applied on COVID-19 Social Media Comments

Web Architecture and Naming for Knowledge Resources

One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web

Web-based resources for comparative genomics

Social Resources