Factors affecting sentence similarity and paraphrasing identification

  • PDF / 1,207,844 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 56 Downloads / 236 Views

DOWNLOAD

REPORT


Factors affecting sentence similarity and paraphrasing identification Marwah Alian1,2 · Arafat Awajan2 Received: 24 February 2020 / Accepted: 3 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Sentence similarity determines whether two sentences are close in their structure and meaning. The detection of sentence similarity can be affected by several factors such as sentence representation, similarity measure, and words weighting function. In this study, the impact of three factors that influence similarity detection and paraphrasing identification is evaluated using clustering algorithms. In the evaluation of the impact of these factors, we tried different word embedding models, clustering algorithms, and weighting methods for the context words. The clustering algorithms are applied to an Arabic paraphrasing benchmark that consists of 1010 pairs of Arabic sentences constructed on the basis of Arabic transformation rules and labeled for similarity and paraphrasing. Experimental results show that pre-trained embedding, weighting context words with part of speech, and labeling sentence pairs by the majority of experts provides better recall and precision. Keywords  Paraphrasing identification · K-means clustering · Agglomerative clustering · Evaluation · Sentence similarity

1 Introduction The estimation of similarity between two parts of text either as words, sentences, or documents is an essential part of many Natural Language Processing (NLP) applications such as text summarization, question answering, information retrieval, and document clustering (Klavans et al. 1999). Semantic similarity is the score that represents semantic relations between two texts, such that the higher the score value, the more similar the meaning of the two texts (Alian and Awajan 2018). Deciding whether two texts have a qualitative semantic relation between them is a challenging task. A semantic relation between two texts could be a paraphrase relation or an entailment relation. In the former, the two texts share the same meaning, whereas in the latter, a text is inferred from the other one (Lintean and Rus 2012). Paraphrasing is the process of representing a sentence with different words and structure to produce a new sentence (Fernando and Stevenson 2008; Awajan and Alian 2020). Paraphrasing may be used to exhibit a good understanding * Marwah Alian [email protected] Arafat Awajan [email protected] 1



Hashemite University, Zarqa, Jordan



Princess Sumaya University for Technology, Amman, Jordan

2

of what has been read by rewriting it with new words or structure given that the original text is referenced, otherwise the result will be considered a type of cheating or plagiarism. Paraphrase identification involves the detection of different linguistic phrases or expressions with similar meaning. Conversely, determining the degree of similarity is part of the semantic similarity task (Jaradat et al. 2017). Semantic similarity is also an essential part of paraphrase detection, as it mea