Context-dependent model for spam detection on social networks

  • PDF / 897,087 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 20 Downloads / 212 Views

DOWNLOAD

REPORT


Context‑dependent model for spam detection on social networks Razan Ghanem1   · Hasan Erbay2 Received: 2 November 2019 / Accepted: 19 August 2020 © Springer Nature Switzerland AG 2020

Abstract Social media platforms are getting an important communication medium in our daily life, and their increasing popularity makes them an ideal platform for spammers to spread spam messages, known as spam problems. Moreover, messages on social media are vague and messy, so a good representation of the text may be the first step to address spam problem. While traditional weighting methods suffer from both high dimensionality and high sparsity problems, traditional word embedding methods suffer from context independence and out of vocabulary problems. To overcome these problems, in this study, we propose a novel architecture based on a context-dependent representation of text using the BERT model. The model was tested using the Twitter dataset, and experimental results show that the proposed method outperforms traditional weighting methods, traditional word embedding based methods as well as the existing state of the art methods used to detect spam on the twitter platform. Keywords  Spam detection · Word embedding · Bidirectional encoder representations from transformers

1 Introduction Social media are interactive computer-mediated technologies that facilitate the creation or sharing of information, ideas, career interests, and other forms of expression via virtual communities and networks. Twitter is one of the most popular social media nowadays. Twitter reported that its worldwide monetizable daily active users (mDAUs) grew by 24% to 166 million in Q1 2020. Each twitter user has, on average, 208 followers, and they post 140 million tweets daily. This popularity of the Twitter platform has made it a suitable environment for spreading spam messages, which have become a challenging problem due to the messy and ambiguity of short text messages on social media. Social spam messages might be defined as irrelevant or unsolicited messages sent over social media such as malicious links, advertisements, or any low-quality content. Unlike long messages like e-mails, social spam messages

are more sparse and ambiguous, and thus spam classification problem in social networks becomes a more challenging problem. One of the important tasks that could be utilized to handle short text on social media is word representation. The traditional word representation methods are based on the Bag of Word (BoW) model in which each word or n-gram is linked to a vector index and marked as 0 or 1 depending on whether it occurs in a given document. Although it produces acceptable results, it suffers from some problems like high dimensionality and high sparsity. Word Embedding methods solve these problems by representing the words as dense vectors, where a vector represents the projection of the word into a continuous vector space. Word2vec is the first-word embedding model introduced by Tomas Mikolov in 2013 at Google. There are two main training algorithms f