The VINITI RAS Automatic Text Classification System for Processing the Flow of Scientific Publications

  • PDF / 941,968 Bytes
  • 11 Pages / 612 x 792 pts (letter) Page_size
  • 44 Downloads / 133 Views

DOWNLOAD

REPORT


RMATION SYSTEMS

The VINITI RAS Automatic Text Classification System for Processing the Flow of Scientific Publications V. S. Egorova, *, E. S. Kozlovaa, **, K. E. Lomotina, ***, O. V. Fedoretsa, ****, A. V. Filimonova, *****, and A. V. Shapkina, ****** a

All-Russian Institute for Scientific and Technical Information (VINITI), Russian Academy of Sciences, Moscow, 125315 Russia *e-mail: [email protected] **e-mail: [email protected] ***e-mail: [email protected] ****e-mail: [email protected] *****e-mail: [email protected] ******e-mail: [email protected] Received April 27, 2020

Abstract—This paper presents the results of the development and testing of an automatic classification system for scientific texts that provides the functionality to determine the topic of texts by three classification schemes in batch and dialog modes. The structural and functional components, the methods used to assess the quality of classification, the teaching methodology, the selection of the optimal classification model, and the main areas for the introduction of an automatic classifier in the processing of electronic document flow at the VINITI RAS are described. Keywords: automatic text classification, Word2Vec, machine learning, perceptron, logistic regression, natural language processing, production technology of the information center DOI: 10.3103/S0005105520030048

INTRODUCTION The technological basis of any information complex (mass media, Internet sites, libraries, centers of scientific and information services, etc.) is a production system that provides users with convenient and efficient navigation access to information arrays. The construction of such a system is traditionally based on the thematic classification of processed objects. For VINITI, the objects of processing are scientific publications, i.e., textual information. Let us define the terminology that we will use. We understand text classification as its indexing by a rubricator. Here, the term “rubricator” is a synonym for the term “classification scheme”; accordingly, the term “heading” is a synonym for the term “class.” Indexing by a rubricator (rubrication) is the procedure of assigning an object indexes of headings (one or several), which are taken from a predetermined list of thematic headings. Typical examples are indexing of literature at libraries or documents at information centers. The most famous classifiers are the Universal Decimal Classification, Dewey Decimal Classification, Library and Bibliographic Classification, State Rubri-

cator of Scientific and Technical Information, and International Patent Classification. Assigning a classification index to a document is a very expensive operation because it requires the involvement of qualified specialists in the subject area. The successes in the development of modern computer technologies in the field of data mining make it possible to create automatic classification programs that are “trained” to analyze texts in order to assign them to some class of a particular rubricator, with varying degrees of probabi