Factors affecting sentence similarity and paraphrasing identification

PDF / 1,207,844 Bytes
9 Pages / 595.276 x 790.866 pts Page_size
56 Downloads / 272 Views

Factors affecting sentence similarity and paraphrasing identification Marwah Alian1,2 · Arafat Awajan2 Received: 24 February 2020 / Accepted: 3 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Sentence similarity determines whether two sentences are close in their structure and meaning. The detection of sentence similarity can be affected by several factors such as sentence representation, similarity measure, and words weighting function. In this study, the impact of three factors that influence similarity detection and paraphrasing identification is evaluated using clustering algorithms. In the evaluation of the impact of these factors, we tried different word embedding models, clustering algorithms, and weighting methods for the context words. The clustering algorithms are applied to an Arabic paraphrasing benchmark that consists of 1010 pairs of Arabic sentences constructed on the basis of Arabic transformation rules and labeled for similarity and paraphrasing. Experimental results show that pre-trained embedding, weighting context words with part of speech, and labeling sentence pairs by the majority of experts provides better recall and precision. Keywords Paraphrasing identification · K-means clustering · Agglomerative clustering · Evaluation · Sentence similarity

1 Introduction The estimation of similarity between two parts of text either as words, sentences, or documents is an essential part of many Natural Language Processing (NLP) applications such as text summarization, question answering, information retrieval, and document clustering (Klavans et al. 1999). Semantic similarity is the score that represents semantic relations between two texts, such that the higher the score value, the more similar the meaning of the two texts (Alian and Awajan 2018). Deciding whether two texts have a qualitative semantic relation between them is a challenging task. A semantic relation between two texts could be a paraphrase relation or an entailment relation. In the former, the two texts share the same meaning, whereas in the latter, a text is inferred from the other one (Lintean and Rus 2012). Paraphrasing is the process of representing a sentence with different words and structure to produce a new sentence (Fernando and Stevenson 2008; Awajan and Alian 2020). Paraphrasing may be used to exhibit a good understanding * Marwah Alian [email protected] Arafat Awajan [email protected] 1

Hashemite University, Zarqa, Jordan

Princess Sumaya University for Technology, Amman, Jordan

2

of what has been read by rewriting it with new words or structure given that the original text is referenced, otherwise the result will be considered a type of cheating or plagiarism. Paraphrase identification involves the detection of different linguistic phrases or expressions with similar meaning. Conversely, determining the degree of similarity is part of the semantic similarity task (Jaradat et al. 2017). Semantic similarity is also an essential part of paraphrase detection, as it mea

Data Loading...

Factors affecting sentence similarity and paraphrasing identification

Recommend Documents

Factors Affecting the Local Governance

Factors affecting alkali jarosite precipitation

Factors affecting intraosseous pressure measurement

Factors affecting international product design

Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing Detection

Identification of independent factors affecting bone mineral density after successful parathyroidectomy for symptomatic

Dominating Factors Affecting Individual Retweeting Behavior

Geological, Geochemical, and Microbial Factors Affecting Coalbed Methane

Factors Affecting Passivation and Resistivity of Cu(Mg) Alloy Film

Factors Affecting Corporate Performance in Countries and Industries

Sociodemographic Differences and Factors Affecting Patient Portal Utilization

Agility in Team Sports: Testing, Training and Factors Affecting Performance