Systematic Homonym Detection and Replacement Based on Contextual Word Embedding

PDF / 1,182,592 Bytes
20 Pages / 439.37 x 666.142 pts Page_size
34 Downloads / 359 Views

Systematic Homonym Detection and Replacement Based on Contextual Word Embedding Younghoon Lee1 Accepted: 9 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Homonyms are words that share their spelling but differ in meaning and are a common feature in most languages. Homonyms are a source of noise i most text analyses and are difficult to detect; numerous studies have been conducted in this regard. However, extant methods typically detect homonyms using a rule-based or statistical-based approach, which requires an answer set, with little regard to the semantic meaning of the word. Therefore, we propose a novel approach for the detection of homonyms based on contextual word embedding that allows a word to be understood based on the context in which it appears. In this study, we extracted all contextual word embedding vectors of individual words and clustered those vectors using a spherical k-means clustering to detect pairs of homonyms. In addition, we developed a homonym replacement method to increase the performance of a document embedding technique, based on the word vector value. We replaced the embedding vectors of homonyms with a representative vector based on the respective meaning using the proposed homonym detection method. Experimental results indicate that the proposed method effectively detects homonyms and significantly improves the performance of document embedding. Keywords Homonym detection · Contextual word embedding · Word-clustering based document embedding · Spherical k-means clustering · ELMo

1 Introduction In recent years, a tremendous growth has been observed in the generation of electronic information [10,11,55,56]. As a result, discovering meaningful information from massive amounts of data is a significant challenge [2,46,54,57]. Text mining is an interdisciplinary field that utilizes text data and draws on information retrieval, data mining, machine learning, statistics, and computational linguistics [3,37,38,40].

B 1

Younghoon Lee [email protected] Department of Industrial Engineering, Seoul National University of Science and Technology, 232, Gongneung-ro, Nowon-gu, Seoul 01811, Republic of Korea

123

Y. Lee

However, several technical challenges, including homonyms, can severely hinder the feasibility and applicability of systematic reviews. Homonyms are words that share their spellings but differ in meaning [31] and is a common feature of most languages [9,13]. A simple example is the word “pen,” which can mean “an enclosure for animals” as well as “a writing tool.” Another example is the term “book,” which can mean “sheets of text bound together” or “make a reservation.” Even in Word2vec [24] and Glove [27], which are the most widely used word embedding methods, the problem associated with homonyms has been identified as a weakness. In Word2vec or Glove, all words with the same spelling are embedded into the same vector space even if they have different meanings, such as book (meaning published text) and book (meaning make a reserv

Data Loading...

Systematic Homonym Detection and Replacement Based on Contextual Word Embedding

Recommend Documents

Word Embedding Techniques for Malware Evolution Detection

Event Detection on Literature by Utilizing Word Embedding

An Approach for Textual Based Clustering Using Word Embedding

Word Embedding-Based Reformulation for Long Queries in Information Search

Measuring the Semantic Stability of Word Embedding

Neighbourhood Projection Embedding Based Image Tampering Detection and Localization

Computing Sentence Embedding by Merging Syntactic Parsing Tree and Word Embedding

Bisociative Literature-Based Discovery: Lessons Learned and New Word Embedding Approach

A text sentiment classification model using double word embedding methods

Hierarchical Context Embedding for Region-Based Object Detection

A Framework for Learning Cross-Lingual Word Embedding with Topics

MDR Cluster-Debias: A Nonlinear Word Embedding Debiasing Pipeline