The KAS corpus of Slovenian academic writing

  • PDF / 1,293,737 Bytes
  • 33 Pages / 439.37 x 666.142 pts Page_size
  • 50 Downloads / 201 Views

DOWNLOAD

REPORT


The KAS corpus of Slovenian academic writing Tomazˇ Erjavec1 • Darja Fisˇer1,2 Nikola Ljubesˇic´1,3



Accepted: 11 September 2020  Springer Nature B.V. 2020

Abstract The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made freely available via on-line concordancers and is openly available for research through the CLARIN.SI research infrastructure. We also present the tools for monoand bilingual term extraction and for thesis structure annotation that were developed in the scope of the project, including the manually annotated datasets used to train these tools. This specialised corpus, large by any standards, represents a substantial and highly useful language resource for the study of Slovenian academic writing and for terminology extraction.

The work presented in this paper was supported by the basic research project J6-7094: ‘‘Slovenian scientific texts: resources and description’’ and by the research programme P2-0103 (B) ‘‘Knowledge Technologies’’, financed by the Slovenian research agency. & Tomazˇ Erjavec [email protected] Darja Fisˇer [email protected] Nikola Ljubesˇic´ [email protected] 1

Department of Knowledge Technologies, Jozˇef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia

2

Department of Translation, Faculty of Arts, University of Ljubljana, Asˇkercˇeva cesta 2, Ljubljana 1000, Slovenia

3

Faculty of Computer Science and Informatics, University of Ljubljana, Vecˇna pot 113, Ljubljana 1000, Slovenia

123

T. Erjavec et al.

Keywords Academic writing  Terminology  Slovenian  Corpus  TEI

1 Introduction The development and use of Slovenian academic language at universities and in research is one of the central questions of the Slovenian language policy. The problem is highlighted in the National Program for Language Policy of the Republic of Slovenia 2014–20181 and a number of European studies also draw attention to the impact that the knowledge and development of academic discourse has on language vitality. This is made very explicit in the two action plans resulting from the Resolution: the Action Plan on Language Education and the Action Plan on Language Resources. In the first document, two out of four goals are related to Slovenian in higher education and science, i.e. ‘‘supporting communication competence in / Slovenian/ scientific language’’, and ‘‘improving the state of Slovenian as a language of science’’. In the second action plan, 8 out of 47 goals are related to Slovenian as a language of science, among them the development of a terminological portal and applications for building new terminological databases, the improvement of terminology extraction tools, the automatisation of LSP corpus building, the digit