Local dense mixed region cutting + global rebalancing: a method for imbalanced text sentiment classification
- PDF / 6,858,492 Bytes
- 16 Pages / 595.276 x 790.866 pts Page_size
- 73 Downloads / 140 Views
ORIGINAL ARTICLE
Local dense mixed region cutting + global rebalancing: a method for imbalanced text sentiment classification Yang Li1 · Jie Wang1 · Suge Wang1,2 · Jiye Liang1,2 · Juanzi Li3 Received: 15 April 2018 / Accepted: 26 July 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2018
Abstract The category imbalance of data in text sentiment classification is a widely existent phenomenon, and it is a serious challenge for designing an effective classifier. In this paper, we propose a two-stage data balancing scheme for text sentiment classification, which not only can make the data boundary clear, but also can balance the class distribution of training data set. The core algorithm LDMRC of the scheme is proposed based on the shortest distance from a point to a straight line, to remove some majority class texts in the neighborhood of a minority class text for balancing the class distribution of data in the local dense mixed region. The second stage employs SS or RS as a data rebalancing strategy to globally balance the training dataset after local dense mixed region cutting. The proposed two-stage data balancing scheme is used by situating at the front of a learning algorithm such as SVM. Using the machine learning algorithm SVM on eight imbalanced data sets including Book_c, Hotel, Jadeite, Insurance in Chinese, and DVD, Book_e, Electronics, Kitchen in English, we verify the effectiveness of the proposed method. The experimental results show that LDMRC is superior to the best existing cutting algorithm BRC for Acc, RN and FN. Furthermore, LDMRC+SS and LDMRC+RS are superior to the corresponding method LDMRC on Chinese datasets. This indicates that alone use of local boundary cutting cannot obtain the best effect, and data rebalancing strategies are necessary for text sentiment classification. Keywords Imbalanced text sentiment classification · Dense mixed region cutting · Global data rebalancing · Resampling
1 Introduction Text sentiment analysis has received much attention of the scientists from fields of machine learning, data mining and especially natural language processing in recent years. As one important issue of text sentiment analysis, text sentiment classification aims to automatically classify texts into one or more of the sentiment polarity categories by mining and analyzing the subjective information in texts, such as standpoint, opinion, and attitude for some things. What motivated the research interests to text sentiment classification are the * Suge Wang [email protected] 1
School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
2
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Taiyuan 030006, China
3
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
extensive practical application requirements. As well known, with the widespread use of Web2.0, user-generated massive text data have been scattered into BBS, Blogs, forum Websites, social media
Data Loading...