An integrated semi-automated framework for domain-based polarity words extraction from an unannotated non-English corpus
- PDF / 2,602,699 Bytes
- 28 Pages / 439.37 x 666.142 pts Page_size
- 62 Downloads / 186 Views
An integrated semi‑automated framework for domain‑based polarity words extraction from an unannotated non‑English corpus Mohammed Kaity1 · Vimala Balakrishnan1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Building sentiment analysis resources is a fundamental step before developing any sentiment analysis model. Sentiment lexicons are one of these critical resources. However, many non-English languages suffer from a severe shortage of these resources and lexicons. This study proposes an integrated framework for extracting domain-based polarity words from unannotated massive non-English corpus. The framework consists of three layers, namely lexicon-based, corpus-based and humanbased. The first two layers automatically recognize and extract new polarity words from a massive unannotated corpus using initial seed lexicons. A key advantage of the proposed framework is that it only needs an initial seed lexicon and unannotated corpus to start the extraction process. Therefore, the framework is semi-automated due to the use of seed lexicons. Experiments on three languages indicate the proposed framework outperformed existing lexicons, achieving F-scores of 77.8%, 83.8% and 68.6% for the Arabic, French and Malay lexicons, respectively. Keywords Multilingual sentiment analysis · Sentiment lexicon · Polarity words · Social media analysis · Unannotated corpus
1 Introduction Over the past 2 decades, sentiment analysis on social media data has received increasing interests. The primary aim of sentiment analysis is to extract embedded opinions found in a given data, such as opinions on products, services, news, social and political events [1, 2]. Many techniques have been developed to classify opinions, one of which is the sentiment lexicon-based approach. A sentiment lexicon is described as a list of opinion or opinionated words and phrases with their sentiment * Vimala Balakrishnan [email protected] 1
Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
13
Vol.:(0123456789)
M. Kaity, V. Balakrishnan
categories or orientations [3–5]. The sentiment orientations indicate the polarity and strength of the words and phrases in the sentiment lexicon (e.g. positive, negative, 1, − 1). Sentiment lexicons can be employed for lexicon-based classification to calculate text polarity (i.e. positive, negative, or neutral) by collecting the orientation values of the polarity or sentiment words in context [3, 6]. Furthermore, sentiment lexicons have been shown to be extremely helpful when used to extract features using machine learning algorithms [5, 7]. Although several researchers have studied the problem of building and expanding sentiment lexicons, there are still many unresolved limitations. For example, the majority of those studies focused on Englishbased sentiment lexicons, while in many other languages these lexicons are either limited or not available, such as Arabic, French and Malay. More oft
Data Loading...