Resources and benchmark corpora for hate speech detection: a systematic review

  • PDF / 584,935 Bytes
  • 47 Pages / 439.37 x 666.142 pts Page_size
  • 42 Downloads / 226 Views

DOWNLOAD

REPORT


Resources and benchmark corpora for hate speech detection: a systematic review Fabio Poletto1 • Valerio Basile1 • Manuela Sanguinetti1 Cristina Bosco1 • Viviana Patti1



 The Author(s) 2020

Abstract Hate Speech in social media is a complex phenomenon, whose detection has recently gained significant traction in the Natural Language Processing community, as attested by several recent review works. Annotated corpora and benchmarks are key resources, considering the vast number of supervised approaches that have been proposed. Lexica play an important role as well for the development of hate speech detection systems. In this review, we systematically analyze the resources made available by the community at large, including their development methodology, topical focus, language coverage, and other factors. The results of our analysis highlight a heterogeneous, growing landscape, marked by several issues and venues for improvement.

The work of F. Poletto is funded by Fondazione Giovanni Goria and Fondazione Cassa di Risparmio di Torino (Talenti della Società Civile 2018). The work of V. Basile, C. Bosco, V. Patti and M. Sanguinetti is partially funded by Progetto di Ateneo/Compagnia di San Paolo 2016 (Immigrants, Hate and Prejudice in Social Media, S1618_L2_BOSC_01) . & Fabio Poletto [email protected] Valerio Basile [email protected] Manuela Sanguinetti [email protected] Cristina Bosco [email protected] Viviana Patti [email protected] 1

University of Turin, Turin, Italy

123

F. Poletto et al.

Keywords Hate speech detection  Benchmark corpora  Natural Language Processing shared tasks  Systematic review

1 Introduction Within the field of AI, and Natural Language Processing (NLP) in particular, techniques for tasks related to Sentiment Analysis and Opinion Mining (SA&OM) grew in relevance over the past decades. Such techniques are typically motivated by purposes such as extracting users’ opinion on a given product or polling political stance. Robust and effective approaches are made possible by the rapid progress in supervised learning technologies and by the huge amount of user-generated contents available online, especially on social media. More recently the NLP community witnesses a growing interest in tasks related to social and ethical issues, also encouraged by the global commitment to fighting extremism, violence, fake news and other plagues affecting the online environment. One such phenomenon is hate speech, a toxic discourse which stems from prejudices and intolerance and which can lead to episodes, and even structured policies, of violence, discrimination and persecution. Hate Speech (HS), lying at the intersection of multiple tensions as expression of conflicts between different groups within and across societies, is a phenomenon that can easily proliferate on social media. It is a vivid example of how technologies with a transformative potential are loaded with both opportunities and challenges. Implying a complex balance between freedom of expression and defense of human dignity, HS is hotly