A crowdsourcing approach to construct mono-lingual plagiarism detection corpus

  • PDF / 1,249,145 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 89 Downloads / 251 Views

DOWNLOAD

REPORT


A crowdsourcing approach to construct mono-lingual plagiarism detection corpus Habibollah Asghari1 · Omid Fatemi1

· Salar Mohtaj2 · Heshaam Faili1

Received: 22 January 2018 / Revised: 13 July 2020 / Accepted: 15 August 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus. Keywords Persian corpus · Crowdsourcing · Plagiarism detection · Text re-use detection · Low resource languages

1 Introduction Plagiarism refers to the usage of other’s words or ideas without appropriate acknowledgment [41]. While text reuse is when the source fragment is properly cited in the target document, plagiarism refers to using a source fragment without proper citation in the target document. In recent years, the high availability of published papers and other documents on digital libraries makes it easier to copy some fragments from different sources and use them to write a new manuscript. Detecting concealed plagiarism in research

B

Omid Fatemi [email protected] Habibollah Asghari [email protected] Salar Mohtaj [email protected] Heshaam Faili [email protected]

1

School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran

2

ICT Research Institute, Academic Center for Education, Culture and Research (ACECR), Tehran, Iran

materials (e.g., scientific papers, theses and other types of publications) is a pressing problem affecting many stakeholders, from researcher to academic publishers, digital libraries and research institutes [25]. In addition to the role of the Internet on increasing plagiarism among students, some policies regarding to science and technology in some countries in recent years, causes to increase the number of scientific papers in these countries. In these policies there are some incentives for the students to publish academic papers. I