A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter
- PDF / 1,867,652 Bytes
- 28 Pages / 439.642 x 666.49 pts Page_size
- 45 Downloads / 174 Views
A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter Usman Naseem1 · Imran Razzak2 · Peter W. Eklund2 Received: 28 April 2020 / Revised: 17 August 2020 / Accepted: 13 October 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques. Keywords Natural language processing · Text pre-processing · Tweet classification · Machine learning
1 Introduction Social media platforms play a more important role in global events than ever before. Analysis of information shared on social media platforms, especially Twitter, has become a Usman Naseem
[email protected] Imran Razzak [email protected] Peter W. Eklund [email protected] 1
University of Sydney, Sydney, Australia
2
Deakin University, Geelong, Australia
Multimedia Tools and Applications
significant focus for researchers in recent years. Millions of Twitter users share their opinion and views on various topics: political debate, the stock market, products, companies and so on. These opinions and views can be used to improve services, develop marketing strategies, to observe user behaviours, to anticipate emerging trends and even to identify important events [33]. Aberrant behaviour also needs to be tracked, monitored and eliminated and in this paper, the classification of “hate speech”1 is the focus of attention. Twitter messages are restricted to 140 characters, so the language used on Twitter is normalised to this limitation, i.e. unstructured, and at times very informal. Although many different pre-processing techniques have been applied to text classification tasks, the impact of pre-processing techniques alone, the different combinations of pre-processors, and the sequence in which they are applied, has not been systematically studied. In this article, a study of different pre-processing techniques
Data Loading...