Active Learning to Remove Source Instances for Domain Adaptation for Word Sense Disambiguation

In this paper, an active learning method of domain adaptation issues for word sense disambiguation is presented. In general, active learning is an approach where data with high learning effect is selected from an unlabeled data set, then labeled manually,

  • PDF / 1,479,474 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 103 Downloads / 187 Views

DOWNLOAD

REPORT


Abstract. In this paper, an active learning method of domain adaptation issues for word sense disambiguation is presented. In general, active learning is an approach where data with high learning effect is selected from an unlabeled data set, then labeled manually, and added to the training data. However, data in the source domain can deteriorate classification precision (misleading data), which extends errors to the domain adaptation. When data labeled by active learning is added to training data, an attempt is made to detect misleading data in the source domain and delete it from the training data. In this way, compared to standard learning classification precision is improved. Keywords: Active learning biguation

1

· Domain adaptation · Word sense disam-

Introduction

When a natural language processing task is performed, the training and test data are usually in the same domain. However, sometimes the data comes from different domains. Recently, studies into domain adaptation have fine-tuned the classifier by using the training data of a learned domain (source domain) to match the test data of another domain (target domain) [5,7,11]. If the subject of the domain adaptation is problematic due to lack of target domain labels, active learning [8,10] and semi-supervised learning [1] are effective. In this paper, we use active learning for domain adaptation for Word Sense Disambiguation (WSD). Generally, active learning is an approach that gradually increases the precision of the classifier by selecting data with a high learning effect from an unlabeled data set, labeling the data, and adding it to the training data, thereby increasing the amount of training data monotonically. However, in domain adaptation, there are data that have a negative influence on the target domain due to classification in the source domain training data. Here we refer to such data as c Springer Science+Business Media Singapore 2016  K. Hasida and A. Purwarianti (Eds.): PACLING 2015, CCIS 593, pp. 97–107, 2016. DOI: 10.1007/978-981-10-0515-2 7

98

H. Shinnou et al.

“misleading data” [3]. In this paper, we detect such data in the source domain training data and delete it to construct training data suitable for the target domain using active learning. In the experiment, we use three domains: Yahoo! Answers (OC), Book (PB) and newspaper (PN) from the Balanced Corpus of Contemporary Written Japanese (BCCWJ [4]). The data set, which is provided by a Japanese WSD SemEval-2 task [6] has word sense tags attached to parts of these corpora. There are 16 multi-sense words with a certain frequency across all domains, and six patterns of domain adaptation (OCPB, PBPN, PNOC, OCPN, PNPB, and PBOC). We investigate domain adaptation for WSD using the proposed active learning method for 16 × 6 = 96 patterns and show the effectiveness of the proposed method.

2 2.1

Active Learning with Deleted Misleading Data Active Learning

Active learning is an approach that reduces the amount of manual labeling when building effective training data.Using a classifier trained on t