An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection
- PDF / 1,264,852 Bytes
- 13 Pages / 595.276 x 790.866 pts Page_size
- 17 Downloads / 217 Views
ORIGINAL PAPER
An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection Andrianna Polydouri1 · Eleni Vathi1 · Georgios Siolas1 · Andreas Stafylopatis1 Received: 5 January 2018 / Accepted: 13 May 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2018
Abstract The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores. Keywords Intrinsic plagiarism detection · Stylometry · Supervised learning · Unbalanced training data · SMOTE · PAN Webis
1 Introduction Plagiarism is the act of taking or closely imitating someone else’s work and presenting it as original, without proper citation or acknowledgment. Plagiarism detection in text documents is divided into two major categories, extrinsic and intrinsic methods, respectively. The difference between them is whether a reference collection of source documents is required. Extrinsic methods detect the suspicious similarities between a collection of potential source documents and a set of suspicious documents, while in intrinsic methods the objective is to identify which of the passages of an investigated document are plagiarized by observing the variation of the writing style within the document. Intrinsic plagiarism detection (IPD) is based on the idea that, not only every author has its own personal and unique writing style, but, by using stylistic and/or semantic means, * Andrianna Polydouri [email protected] 1
Intelligent Systems, Content and Interaction Laboratory, School of Electrical and Computer Engineering, National and Technical University of Athens, Athens, Greece
this style can be detected and quantified. As a result, by analyzing a document and searching for passages that do not seem to fit the personal writing style of the author, it is also possible to detect potential plagiarism. All these lie under one condition: the examined document
Data Loading...