A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

PDF / 1,867,652 Bytes
28 Pages / 439.642 x 666.49 pts Page_size
45 Downloads / 203 Views

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter Usman Naseem1 · Imran Razzak2 · Peter W. Eklund2 Received: 28 April 2020 / Revised: 17 August 2020 / Accepted: 13 October 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques. Keywords Natural language processing · Text pre-processing · Tweet classification · Machine learning

1 Introduction Social media platforms play a more important role in global events than ever before. Analysis of information shared on social media platforms, especially Twitter, has become a Usman Naseem

[email protected] Imran Razzak [email protected] Peter W. Eklund [email protected] 1

University of Sydney, Sydney, Australia

2

Deakin University, Geelong, Australia

Multimedia Tools and Applications

significant focus for researchers in recent years. Millions of Twitter users share their opinion and views on various topics: political debate, the stock market, products, companies and so on. These opinions and views can be used to improve services, develop marketing strategies, to observe user behaviours, to anticipate emerging trends and even to identify important events [33]. Aberrant behaviour also needs to be tracked, monitored and eliminated and in this paper, the classification of “hate speech”1 is the focus of attention. Twitter messages are restricted to 140 characters, so the language used on Twitter is normalised to this limitation, i.e. unstructured, and at times very informal. Although many different pre-processing techniques have been applied to text classification tasks, the impact of pre-processing techniques alone, the different combinations of pre-processors, and the sequence in which they are applied, has not been systematically studied. In this article, a study of different pre-processing techniques

Data Loading...

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Recommend Documents

Detecting Hate Speech Online: A Case of Croatian

Hate Speech Detection Using Transformer Ensembles on the HASOC Dataset

Characterizing networks of propaganda on twitter: a case study

A Comprehensive Survey on Passive Video Forgery Detection Techniques

Detection of Harassment on Twitter with Deep Learning Techniques

Survey on Fake News Detection Techniques

Defend Your Enemy. A Qualitative Study on Defending Political Opponents Against Hate Speech Online

A Recent Survey on Information-Hiding Techniques

A Biomedical Survey on Osteoporosis Classification Techniques

Early Detection of Diabetic Retinopathy Using Machine Learning Techniques: A Survey on Recent Trends and Techniques

Spam Detection on Arabic Twitter

A Study on Abnormalities Detection Techniques from Echocardiogram