CatDetect, a framework for detecting Catalan tweets

PDF / 697,746 Bytes
21 Pages / 439.642 x 666.49 pts Page_size
23 Downloads / 276 Views

CatDetect, a framework for detecting Catalan tweets Sergi Plaza1 · Jordi Vilaplana2 · Jordi Mateo2 · Josep Rius3 · Francesc Solsona2 Received: 20 April 2020 / Revised: 11 September 2020 / Accepted: 11 November 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This work deals with language detection. It includes new proposals ranging from lexicon and morphological analysis to an increasing use of machine learning solutions. In this case, the language study is focused on Catalan, a minority language. In the context of the Twitter social network, this increases difficulty in detecting tweets (messages written on the Twitter social network). To achieve that, a Catalan-Twitter corpus was generated using lexical and morphological approaches, which then will be used to create supervised models based on machine learning techniques. They were also evaluated in order to see which obtains the best prediction score and thus, is the most suitable to be used. We demonstrate how our proposal is successful with Twitter in the case of minority languages. The best model is to be used on a website, where users can test the algorithm interactively in the front-end webpage and in background by means of a webservice across a RESTful API. Keywords Catalan · Language detection · Twitter corpus · Machine learning

Francesc Solsona

[email protected] Sergi Plaza [email protected] Jordi Vilaplana [email protected] Jordi Mateo [email protected] Josep Rius [email protected] 1

GFT. Parc Cient´ıfic i Tecnol`ogic de Lleida, Building H1, Lleida, 25003, Spain

2

Department of Computer Science, University of Lleida, Jaume II 69, Lleida, 25001, Spain

3

Department of AEGERN, University of Lleida, Jaume II 73, Lleida, 25001, Spain

Multimedia Tools and Applications

1 Introduction Language detection is the problem defined as the processing of natural language in order to determine the language of a given sentence, paragraph or text [3]. The difficulty of natural language processing lies in the fact there are more than six thousand different languages [26]. Many different approaches have been proposed to achieve this aim. Catalan is a minority Language, so although language detection has been widely researched, Catalan has been little studied [5]. Right now, even Twitter does not tag Catalan tweets and it lacks APIs supporting Catalan detection. Despite the fact that the language identification problem is generally considered to be solved [19], there are particular contexts that considerably hinder this task [8]. Tweets are characterized by very short texts, a high level of noise (hashtags, mentions, urls, emoticons, emojis, etc.), a mixture of languages, slang expressions and strong modifications to spelling. All these factors make their classification difficult [17]. Then, it is necessary to design strategies specifically conceived to deal with the particular characteristics of Twitter to solve the language identification problem. In 2012, Bergsma et al. [2] raised an important cr

Data Loading...

CatDetect, a framework for detecting Catalan tweets

Recommend Documents

Using Transfer Learning for Detecting Drug Mentions in Tweets

A stacked convolutional neural network for detecting the resource tweets during a disaster

Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set

A Rumor Detection in Russian Tweets

A semi-automated BPMN-based framework for detecting conflicts between security, data-minimization, and fairness requirem

Polarizing Tweets on Climate Change

The Catalan Issue from a Comparative Constitutional Perspective

An Introduction to Catalan Numbers

Preprocessing Steps for Opinion Mining on Tweets

Health-Related Tweets Classification: A Survey

Who tweets about sports law?

Catalan Independence and the Crisis of Sovereignty