A multi-platform dataset for detecting cyberbullying in social media

  • PDF / 487,127 Bytes
  • 24 Pages / 439.37 x 666.142 pts Page_size
  • 34 Downloads / 236 Views

DOWNLOAD

REPORT


A multi-platform dataset for detecting cyberbullying in social media David Van Bruwaene1 • Qianjia Huang2 Diana Inkpen2



Ó Springer Nature B.V. 2020

Abstract Recent work on cyberbullying detection relies on using machine learning models with text and metadata in small datasets, mostly drawn from single social media platforms. Such models have succeeded in predicting cyberbullying when dealing with posts containing the text and the metadata structure as found on the platform. Instead, we develop a multi-platform dataset that consists purely of the text from posts gathered from seven social media platforms. We present a multistage and multi-technique annotation system that initially uses crowdsourcing for post and hashtag annotation and subsequently utilizes machine-learning methods to identify additional posts for annotation. This process has the benefit of selecting posts for annotation that have a significantly greater than chance likelihood of constituting clear cases of cyberbullying without limiting the range of samples to those containing predetermined features (as is the case when hashtags alone are used to select posts for annotation). We show that, despite the diversity of examples present in the dataset, good performance is possible for models trained on datasets produced in this manner. This becomes a clear advantage compared to traditional methods of post selection and labeling because it increases the number of positive This project was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Ontario Centers of Excellence (OCE), and SafeToNet Ltd. & David Van Bruwaene [email protected] Qianjia Huang [email protected] Diana Inkpen [email protected] 1

SafeToNet Ltd., 51 Breithaupt Street, Suite 100, Kitchener, ON N2H 5G5, Canada

2

School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada

123

D. Van Bruwaene et al.

examples that can be produced using the same resources and it enhances the diversity of communication media to which the models can be applied. Keywords Cyberbullying  Bullying  Cyberaggression  Dataset  Social media  Machine learning  Deep learning  Natural language processing

1 Introduction Cyberbullying, which can be defined as ‘an aggressive, intentional act carried out by a group or individual, using electronic forms of contact, repeatedly and over time against a victim who cannot easily defend him or herself (Smith et al. 2008), has become a pernicious social problem in recent years. According to the Cyber Bullying Research Center,1 about half of American teenagers have experienced cyberbullying, and 10 to 20 percent are involved in repeated cyberbullying events. This is especially worrying, as multiple studies found that cyberbullying victims often have psychiatric and psychosomatic disorders (Beckman et al. 2012), and a British study found that nearly half of suicides among young people were related to bullying (BBC News).2 These factors underscore an urgent need to understand