Systematic Homonym Detection and Replacement Based on Contextual Word Embedding
- PDF / 1,182,592 Bytes
- 20 Pages / 439.37 x 666.142 pts Page_size
- 34 Downloads / 221 Views
Systematic Homonym Detection and Replacement Based on Contextual Word Embedding Younghoon Lee1 Accepted: 9 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Homonyms are words that share their spelling but differ in meaning and are a common feature in most languages. Homonyms are a source of noise i most text analyses and are difficult to detect; numerous studies have been conducted in this regard. However, extant methods typically detect homonyms using a rule-based or statistical-based approach, which requires an answer set, with little regard to the semantic meaning of the word. Therefore, we propose a novel approach for the detection of homonyms based on contextual word embedding that allows a word to be understood based on the context in which it appears. In this study, we extracted all contextual word embedding vectors of individual words and clustered those vectors using a spherical k-means clustering to detect pairs of homonyms. In addition, we developed a homonym replacement method to increase the performance of a document embedding technique, based on the word vector value. We replaced the embedding vectors of homonyms with a representative vector based on the respective meaning using the proposed homonym detection method. Experimental results indicate that the proposed method effectively detects homonyms and significantly improves the performance of document embedding. Keywords Homonym detection · Contextual word embedding · Word-clustering based document embedding · Spherical k-means clustering · ELMo
1 Introduction In recent years, a tremendous growth has been observed in the generation of electronic information [10,11,55,56]. As a result, discovering meaningful information from massive amounts of data is a significant challenge [2,46,54,57]. Text mining is an interdisciplinary field that utilizes text data and draws on information retrieval, data mining, machine learning, statistics, and computational linguistics [3,37,38,40].
B 1
Younghoon Lee [email protected] Department of Industrial Engineering, Seoul National University of Science and Technology, 232, Gongneung-ro, Nowon-gu, Seoul 01811, Republic of Korea
123
Y. Lee
However, several technical challenges, including homonyms, can severely hinder the feasibility and applicability of systematic reviews. Homonyms are words that share their spellings but differ in meaning [31] and is a common feature of most languages [9,13]. A simple example is the word “pen,” which can mean “an enclosure for animals” as well as “a writing tool.” Another example is the term “book,” which can mean “sheets of text bound together” or “make a reservation.” Even in Word2vec [24] and Glove [27], which are the most widely used word embedding methods, the problem associated with homonyms has been identified as a weakness. In Word2vec or Glove, all words with the same spelling are embedded into the same vector space even if they have different meanings, such as book (meaning published text) and book (meaning make a reserv
Data Loading...