Various Pre-processing Strategies for Domain-Based Sentiment Analysis of Unbalanced Large-Scale Reviews

User reviews are important resources for many processes such as recommender systems and decision-making programs. Sentiment analysis is one of the processes that is very useful for extracting the valuable information from these reviews. Data preprocessing

  • PDF / 284,799 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 10 Downloads / 185 Views

DOWNLOAD

REPORT


2

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia [email protected] Faculty of Applied Sciences, Department of Computer Science, Taiz University, Taizz, Yemen

Abstract. User reviews are important resources for many processes such as recommender systems and decision-making programs. Sentiment analysis is one of the processes that is very useful for extracting the valuable information from these reviews. Data preprocessing step is of importance in the sentiment analysis process, in which suitable preprocessing methods are necessary. Most of the available research that study the effect of preprocessing methods focus on balanced small-sized dataset. In this research, we apply different preprocessing methods for building a domain lexicon for unbalanced big-sized reviews. The applied preprocessing methods study the effects of stopwords, negation words and the number of word’s occurrence. Followed by applying different preprocessing methods to determine the words that have high sentiment orientations in calculating the total review sentiment score. Two main experiments with five cases are tested on the Amazon dataset for the movie domain. The best suitable preprocessing method is then selected for building the domain lexicon as well as calculating the total review sentiment score using the generated lexicon. Finally, we evaluate the proposed lexicon by comparing it with the general-based lexicon. The proposed lexicon outperforms the general lexicon in calculating the total review sentiment score in term of accuracy and F1-measure. Furthermore, the results prove that sentiment words are not restricted to adjectives and adverbs only (as commonly claimed); nouns and verbs also contribute to the sentiment score and thus effects in the sentiment analysis process. Moreover, the results also show that negation words have positive effects in the sentiment analysis process. Keywords: User reviews  Sentiment analysis  Data preprocessing methods  Domain-based lexicon  Unbalanced dataset  Sentiment words

1 Introduction Millions of people share their opinions on goods, services and deals on a regular basis, using, among others, online channels such as social networks, forums, wikis, and discussion boards. These reviews reflect the users’ experiences on the consumed © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. E. Hassanien et al. (Eds.): AISI 2020, AISC 1261, pp. 204–214, 2021. https://doi.org/10.1007/978-3-030-58669-0_19

Various Pre-processing Strategies for Domain-Based Sentiment Analysis

205

services and have significant importance for users, vendors, and companies. The essence of these reviews is complicated, in which they are short, unstructured and sensitive to noise, since they are written by regular, non-professional users [1]. To get benefit from these reviews, many fields are involved in processing them such as sentiment analysis. Sentiment Analysis (SA) is used to extract the feeling or opin