Fake opinion detection: how similar are crowdsourced datasets to real data?
- PDF / 564,666 Bytes
- 40 Pages / 439.37 x 666.142 pts Page_size
- 33 Downloads / 180 Views
Fake opinion detection: how similar are crowdsourced datasets to real data? Tommaso Fornaciari1 • Leticia Cagnina2 Paolo Rosso3 • Massimo Poesio4
•
Springer Nature B.V. 2020
Abstract Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this way are generally successful at discriminating between ‘genuine’ online reviews and the crowdsourced deceptive reviews. It has been argued that the deceptive reviews obtained via crowdsourcing are very different from real fake reviews, but the claim has never been properly tested. In this paper, we compare (false) crowdsourced reviews with a set of ‘real’ fake reviews published on line. We evaluate their degree of similarity and their usefulness in training models for the detection of untrustworthy reviews. We find that the deceptive reviews collected via crowdsourcing are significantly different from the fake reviews published online. In the case of the artificially produced deceptive texts, it turns out that their domain similarity with the targets affects the models’ performance, much more than their & Tommaso Fornaciari [email protected] Leticia Cagnina [email protected] Paolo Rosso [email protected] Massimo Poesio [email protected] 1
Bocconi University, Milan, Italy
2
Universidad Nacional de San Luis, San Luis, Argentina
3
Universitat Polite`cnica de Vale`ncia, Valencia, Spain
4
Queen Mary University of London, London, UK
123
T. Fornaciari et al.
untruthfulness. This suggests that the use of crowdsourced datasets for opinion spam detection may not result in models applicable to the real task of detecting deceptive reviews. As an alternative method to create large-size datasets for the fake reviews detection task, we propose methods based on the probabilistic annotation of unlabeled texts, relying on the use of meta-information generally available on the ecommerce sites. Such methods are independent from the content of the reviews and allow to train reliable models for the detection of fake reviews. Keywords Deception detection Crowdsourcing Ground truth Probabilistic labeling
1 Introduction Many E-commerce sites, such as Amazon1, Ebay2, Tripadvisor3 and similar, give customers the opportunity to leave comments concerning their products. Shoppers appreciate the possibility of sharing their opinions, and often take advantage of other consumers’ experience. However, the lack of controls on those who are enabled to publish reviews, exposes customers to the risk of finding texts which do not express honest opinions, but are concealed forms of commercial promotion. To identify such disguised advertisements is not trivial and the dimension of the phenomenon is difficult to estimate. Even so, there is a growi
Data Loading...