Self-training classifier of natural-language texts

  • PDF / 177,368 Bytes
  • 7 Pages / 595 x 842 pts (A4) Page_size
  • 80 Downloads / 232 Views

DOWNLOAD

REPORT


SELF-TRAINING CLASSIFIER OF NATURAL-LANGUAGE TEXTS

UDC 519.683.004.424

E. S. Borisov

A natural-language text classifier is developed using an artificial neural network. A model of the classifier and its implementation are proposed. The classification system consists of two main components, namely, a frequency analyzer and a neural network classifier. Before using the classifier, the user should first prepare a set of training texts and then train the classifier. Keywords: neural network, classifier, natural language. INTRODUCTION The tremendous development of the Internet naturally produced problems of sorting and searching for information. At present, terabytes of information are stored in electronic storehouses in the world. The number of information sources increases and the obtainment of necessary knowledge from this body of information is increasingly difficult. A person cannot efficiently solve this problem without assistance, and existing search systems not always cope with it. In connection with these problems, the problem of construction of intellectual information agents becomes increasingly urgent. After obtaining a user request, such systems must browse the network, classify information, analyze it, and, based on the collected knowledge items, generate a small, convenient for perception, and maximally complete answer to the user’s question. In this article, the problem of classification of natural-language texts is considered as the first step to the solution of the problem of construction of intelligent information agents. 1. GENERAL SCHEME OF A CLASSIFICATION SYSTEM The proposed classification system consists of the following two main components: the frequency analyzer with the system dictionary and neural network classifier that are presented in Fig. 1. A text arrives at the input of the system, and the subject (theme) to which this text is devoted (sport, politics, medicine, etc.) is obtained at its output. Before using the classifier, the user should determine the classes to be processed by the system and select a set of training texts. Then words are selected from the set of training texts, i.e., the system dictionary is formed. At the last stage of system initialization, the neural network classifier is trained using the training texts and obtained system dictionary. After the execution of the training procedure, the text classifier is ready for operation. 1.1. Frequency Analyzer and System Dictionary. The first component of the system consist of a frequency analyzer together with the system dictionary and computes the so-called frequency characteristic of an input text. The frequency analyzer realizes the well-known linguistic method of processing natural-language texts, i.e., the frequency analysis of a text with a view to finding the distribution of repetitions of words in the text. For each word u i from the system dictionary V, this component of the system determines its frequency of occurrence f i in an input text t (Fig. 2). A frequency characteristic is a vector f = ( f1 , . . . , f n