CatDetect, a framework for detecting Catalan tweets
- PDF / 697,746 Bytes
- 21 Pages / 439.642 x 666.49 pts Page_size
- 23 Downloads / 189 Views
CatDetect, a framework for detecting Catalan tweets Sergi Plaza1 · Jordi Vilaplana2 · Jordi Mateo2 · Josep Rius3 · Francesc Solsona2 Received: 20 April 2020 / Revised: 11 September 2020 / Accepted: 11 November 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract This work deals with language detection. It includes new proposals ranging from lexicon and morphological analysis to an increasing use of machine learning solutions. In this case, the language study is focused on Catalan, a minority language. In the context of the Twitter social network, this increases difficulty in detecting tweets (messages written on the Twitter social network). To achieve that, a Catalan-Twitter corpus was generated using lexical and morphological approaches, which then will be used to create supervised models based on machine learning techniques. They were also evaluated in order to see which obtains the best prediction score and thus, is the most suitable to be used. We demonstrate how our proposal is successful with Twitter in the case of minority languages. The best model is to be used on a website, where users can test the algorithm interactively in the front-end webpage and in background by means of a webservice across a RESTful API. Keywords Catalan · Language detection · Twitter corpus · Machine learning
Francesc Solsona
[email protected] Sergi Plaza [email protected] Jordi Vilaplana [email protected] Jordi Mateo [email protected] Josep Rius [email protected] 1
GFT. Parc Cient´ıfic i Tecnol`ogic de Lleida, Building H1, Lleida, 25003, Spain
2
Department of Computer Science, University of Lleida, Jaume II 69, Lleida, 25001, Spain
3
Department of AEGERN, University of Lleida, Jaume II 73, Lleida, 25001, Spain
Multimedia Tools and Applications
1 Introduction Language detection is the problem defined as the processing of natural language in order to determine the language of a given sentence, paragraph or text [3]. The difficulty of natural language processing lies in the fact there are more than six thousand different languages [26]. Many different approaches have been proposed to achieve this aim. Catalan is a minority Language, so although language detection has been widely researched, Catalan has been little studied [5]. Right now, even Twitter does not tag Catalan tweets and it lacks APIs supporting Catalan detection. Despite the fact that the language identification problem is generally considered to be solved [19], there are particular contexts that considerably hinder this task [8]. Tweets are characterized by very short texts, a high level of noise (hashtags, mentions, urls, emoticons, emojis, etc.), a mixture of languages, slang expressions and strong modifications to spelling. All these factors make their classification difficult [17]. Then, it is necessary to design strategies specifically conceived to deal with the particular characteristics of Twitter to solve the language identification problem. In 2012, Bergsma et al. [2] raised an important cr
Data Loading...