Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing
Cross-language plagiarism detection attempts to identify and extract automatically plagiarism among documents in different languages. Plagiarized fragments can be translated verbatim copies or may alter their structure to hide the copying, which is known
- PDF / 349,818 Bytes
- 10 Pages / 439.363 x 666.131 pts Page_size
- 25 Downloads / 141 Views
Natural Language Engineering Lab - ELiRF, DSIC Universitat Polit`ecnica de Val`encia, Valencia, Spain {mfranco,pgupta,prosso}@dsic.upv.es 2 Linguistic Computing Laboratory (LCL) Sapienza Universit`a di Roma, Roma, Italy [email protected]
Abstract. Cross-language plagiarism detection attempts to identify and extract automatically plagiarism among documents in different languages. Plagiarized fragments can be translated verbatim copies or may alter their structure to hide the copying, which is known as paraphrasing and is more difficult to detect. In order to improve the paraphrasing detection, we use a knowledge graph-based approach to obtain and compare context models of document fragments in different languages. Experimental results in German-English and Spanish-English crosslanguage plagiarism detection indicate that our knowledge graph-based approach offers a better performance compared to other state-of-the-art models. Keywords: Cross-language plagiarism detection, textual similarity, paraphrasing, knowledge graphs, BabelNet.
1 Introduction One of the biggest problems in literature and science is plagiarism: unauthorized use of the original content. Plagiarism is very difficult to detect, especially when the web is the source of information due to its size. The detection of plagiarism is even more difficult when it concerns documents written in different languages. Recently a survey was done on scholar practices and attitudes [2], also from a cross-language (CL) plagiarism perspective which manifests that CL plagiarism is a real problem: only 36.25% of students think that translating a text fragment and including it into their report is plagiarism. Plagiarized fragments can be translated verbatim copies, or can be hidden by their authors altering its structure, which is known as paraphrasing. In the recent study on paraphrasing in plagiarism [1] it has been shown that paraphrase mechanisms make
The research has been carried out in the framework of the European Commission WIQ-EI IRSES (no. 269180) and DIANA-APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) projects as well as the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. We thank Roberto Navigli for offering help to get familiar with the BabelNet API.
N. Ferro (Ed.): PROMISE Winter School 2013, LNCS 8173, pp. 227–236, 2014. c Springer-Verlag Berlin Heidelberg 2014
228
M. Franco-Salvador, P. Gupta, and P. Rosso
plagiarism detection more difficult. Moreover, this study also shows that lexical substitutions are the paraphrase mechanisms most used in plagiarism, shortening the plagiarized text. This may be used in future to develop more effective plagiarism detectors. In recent years there have been a few approaches to CL similarity analysis that can be used for CL plagiarism detection. A simple, yet effective approach is the crosslanguage character n-gram (CL-CNG) model [9] which is based on the syntax of documents, which uses character n-grams, and offers remarkable perform
Data Loading...