An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

PDF / 1,264,852 Bytes
13 Pages / 595.276 x 790.866 pts Page_size
17 Downloads / 232 Views

ORIGINAL PAPER

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection Andrianna Polydouri1 · Eleni Vathi1 · Georgios Siolas1 · Andreas Stafylopatis1 Received: 5 January 2018 / Accepted: 13 May 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores. Keywords Intrinsic plagiarism detection · Stylometry · Supervised learning · Unbalanced training data · SMOTE · PAN Webis

1 Introduction Plagiarism is the act of taking or closely imitating someone else’s work and presenting it as original, without proper citation or acknowledgment. Plagiarism detection in text documents is divided into two major categories, extrinsic and intrinsic methods, respectively. The difference between them is whether a reference collection of source documents is required. Extrinsic methods detect the suspicious similarities between a collection of potential source documents and a set of suspicious documents, while in intrinsic methods the objective is to identify which of the passages of an investigated document are plagiarized by observing the variation of the writing style within the document. Intrinsic plagiarism detection (IPD) is based on the idea that, not only every author has its own personal and unique writing style, but, by using stylistic and/or semantic means, * Andrianna Polydouri [email protected] 1

Intelligent Systems, Content and Interaction Laboratory, School of Electrical and Computer Engineering, National and Technical University of Athens, Athens, Greece

this style can be detected and quantified. As a result, by analyzing a document and searching for passages that do not seem to fit the personal writing style of the author, it is also possible to detect potential plagiarism. All these lie under one condition: the examined document

Data Loading...

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

Recommend Documents

LoRAS: an oversampling approach for imbalanced datasets

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

Improving Imbalanced Classification by Anomaly Detection

A Robust Approach to Plagiarism Detection in Handwritten Documents

Classification Accuracy Comparison for Imbalanced Datasets with Its Balanced Counterparts Obtained by Different Sampling

A crowdsourcing approach to construct mono-lingual plagiarism detection corpus

Large margin classifiers to generate synthetic data for imbalanced datasets

Fiber Optic Sensor for Acid Detection: An Efficient and Fast Approach for Concentrated Sulphuric Acid Detection

Automatic plagiarism detection in obfuscated text

Research on MLChecker Plagiarism Detection System

Efficient Deep Learning Approach for Multi-label Semantic Scene Classification

Employing Decision Templates to Imbalanced Data Classification