A novel Tag Score (T_S) model with improved K-means for clustering tweets
- PDF / 587,954 Bytes
- 13 Pages / 595.276 x 790.866 pts Page_size
- 26 Downloads / 151 Views
Sådhanå (2020)45:125 https://doi.org/10.1007/s12046-020-01359-5
Sadhana(0123456789().,-volV)FT3](012345 6789().,-volV)
A novel Tag Score (T_S) model with improved K-means for clustering tweets S POOMAGAL*, B MALAR, J INAMUL HASSAN and R KISHOR Department of Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, India e-mail: [email protected]; [email protected]; [email protected]; [email protected] MS received 17 July 2018; revised 23 January 2020; accepted 23 February 2020 Abstract. Clustering of tweets is useful for analyzing the attitudes of people towards a particular product. The companies can use this analysis to modify their products to meet the needs of people. Recently, K-means clustering is widely used to cluster the tweets with bag of words as a feature set. The key factors contributing to the quality of clusters and performance of clustering are dimensionality reduction and initial selection of centroids. This paper addresses these issues using a newly proposed Tag Score (T_S) model with improved K-means in which semantically similar features from bag of words are grouped into tags, scores are modified based on sentiment polarity values and the initial centroids are selected with the help of sentiment scores. The performance of the proposed T_S model with improved K-means is compared with T_S model with random K-means and conventional word vectors with random K-means by considering three labeled datasets and three unlabeled datasets. The results show that the proposed method produces significant results in approximately 70% of the cases in terms of purity, F-measure, intra-cluster distance and inter-cluster distance. Keywords.
Clustering; K-means; tweets; sentiment analysis; opinion mining; Sentiwordnet.
1. Introduction Twitter is one of the popular social media sites wherein people post tweets to show their likes and grievances on a particular policy or product. Companies can analyze these tweets to understand the acceptance or rejection of their products by their customers. This analysis can then be used in decision making to improve their business. Users who registered with Twitter can only post tweets and the number of characters in each tweet can be at most 280. In addition to the text content, it also has other parts in it such as hash tag, URL and username. Hash tag contains keywords which are used to mention the topic on which the tweet is posted. URL is used for analytical purposes. Natural language processing and machine learning play a major role in identifying the part of the text content which needs to be considered for analysis and in finding the meaning of the words (synset) present in it with their polarity values (Sentiwordnet) as positive, negative and neutral. This classification is essential in many scenarios like, to get public’s opinion on government’s policies and to review them based on their feedback. Existing techniques in the literature extract only the words in tweets for clustering and they used Wordnet and Senti
Data Loading...