Matrix Models of Texts: Models of Texts and Content Similarity of Text Documents

  • PDF / 380,770 Bytes
  • 10 Pages / 612 x 792 pts (letter) Page_size
  • 97 Downloads / 219 Views

DOWNLOAD

REPORT


ix Models of Texts: Models of Texts and Content Similarity of Text Documents M. G. Kreinesa, * and E. M. Kreinesa, ** aBaseTech

Llc, Moscow, 129366 Russia *e-mail: [email protected] **e-mail: [email protected]

Received May 16, 2019; revised May 16, 2019; accepted July 1, 2019

Abstract—The article presents a matrix model of natural language texts and a model for the quantitative assessment of the similarity in the content of text documents. The use of the model to identify text documents with similar contents is considered. The differences between the proposed models—the matrix model and the model of the similarity of the content of text documents constructed based on it—and the commonly used approaches to analyze and model natural language texts are analyzed. Keywords: texts in natural languages, content similarity, content similarity assessment, models of text documents, information retrieval and analysis DOI: 10.1134/S2070048220050105

INTRODUCTION This study embodies the authors’ belief that the study of properties of natural languages as systems for presenting, transmitting, and discussing information and/or knowledge and its practical applications cannot be limited to the study of syntactic units (phrases and sentences). A meaningful and useful study of the semantics and word usage in the immediate phraseological context has been carried to contribute in the creation of a national corpora of different languages (see, for example, [www.ruscorpora.ru, www.natcorp.ox.ac.uk]). Another approach to conceptually similar problems involves constructing models of words that infer their semantics from the results of the probabilistic or combinatorial analysis of their lexical distributions (for example, the distributive-statistical analysis model [1] or word2vec type models [2, 3]). Analyzing the text as a whole, above the level of the phraseological context, can provide a lot of valuable information about some (not necessarily all) words in that text. The transition from analyzing a specific text to analyzing collections of texts further expands the possibilities for the study and practical application of knowledge about the meaning and usage of words of a natural language. Within this approach, the popular phrase “This is only a word. And words mean different things for different people” becomes a research topic that goes far beyond the semantic fields constructed by the distributive-statistical analysis of text documents [1]. Such research has measurable applicational significance due to the critical importance of its results in solving the practical problems of organizing a rational search and forming objective assessments of the information and knowledge recorded in natural languages in the form of unstructured texts. The general concept of the presented mathematical models of text documents (hereinafter, texts) and collections of text documents (hereinafter, collections of texts) targeted at information retrieval and analysis has been previously described in [4]. A review of approaches used to model texts and t