The VINITI RAS Automatic Text Classification System for Processing the Flow of Scientific Publications

PDF / 941,968 Bytes
11 Pages / 612 x 792 pts (letter) Page_size
44 Downloads / 209 Views

RMATION SYSTEMS

The VINITI RAS Automatic Text Classification System for Processing the Flow of Scientific Publications V. S. Egorova, *, E. S. Kozlovaa, **, K. E. Lomotina, ***, O. V. Fedoretsa, ****, A. V. Filimonova, *****, and A. V. Shapkina, ****** a

All-Russian Institute for Scientific and Technical Information (VINITI), Russian Academy of Sciences, Moscow, 125315 Russia *e-mail: [email protected] **e-mail: [email protected] ***e-mail: [email protected] ****e-mail: [email protected] *****e-mail: [email protected] ******e-mail: [email protected] Received April 27, 2020

Abstract—This paper presents the results of the development and testing of an automatic classification system for scientific texts that provides the functionality to determine the topic of texts by three classification schemes in batch and dialog modes. The structural and functional components, the methods used to assess the quality of classification, the teaching methodology, the selection of the optimal classification model, and the main areas for the introduction of an automatic classifier in the processing of electronic document flow at the VINITI RAS are described. Keywords: automatic text classification, Word2Vec, machine learning, perceptron, logistic regression, natural language processing, production technology of the information center DOI: 10.3103/S0005105520030048

INTRODUCTION The technological basis of any information complex (mass media, Internet sites, libraries, centers of scientific and information services, etc.) is a production system that provides users with convenient and efficient navigation access to information arrays. The construction of such a system is traditionally based on the thematic classification of processed objects. For VINITI, the objects of processing are scientific publications, i.e., textual information. Let us define the terminology that we will use. We understand text classification as its indexing by a rubricator. Here, the term “rubricator” is a synonym for the term “classification scheme”; accordingly, the term “heading” is a synonym for the term “class.” Indexing by a rubricator (rubrication) is the procedure of assigning an object indexes of headings (one or several), which are taken from a predetermined list of thematic headings. Typical examples are indexing of literature at libraries or documents at information centers. The most famous classifiers are the Universal Decimal Classification, Dewey Decimal Classification, Library and Bibliographic Classification, State Rubri-

cator of Scientific and Technical Information, and International Patent Classification. Assigning a classification index to a document is a very expensive operation because it requires the involvement of qualified specialists in the subject area. The successes in the development of modern computer technologies in the field of data mining make it possible to create automatic classification programs that are “trained” to analyze texts in order to assign them to some class of a particular rubricator, with varying degrees of probabi

Data Loading...

The VINITI RAS Automatic Text Classification System for Processing the Flow of Scientific Publications

Recommend Documents

Application of Automatic Text-Classification Algorithm Based on Feature Extraction for Intelligent System of Transportat

The Influence of Text Length on Text Classification Model

Text Classification

Orthopedics and COVID-19: Scientific Publications Rush

Text Classification

Aspects of Automatic Text Analysis

Text Processing

ImTeNet: Image-Text Classification Network for Abnormality Detection and Automatic Reporting on Musculoskeletal Radiogra

Automatic Processing

Smart Tourism: A Bibliometric Analysis of Scientific Publications from the Scopus and Web of Science Databases

In-Flow and Out-Flow Problem for the Stokes System

Multi-domain Transfer Learning for Text Classification