Randomized projective methods for the construction of binary sparse vector representations

  • PDF / 429,137 Bytes
  • 11 Pages / 595.276 x 793.701 pts Page_size
  • 88 Downloads / 136 Views

DOWNLOAD

REPORT


NEW MEANS OF CYBERNETICS, INFORMATICS, COMPUTER ENGINEERING, AND SYSTEMS ANALYSIS RANDOMIZED PROJECTIVE METHODS FOR THE CONSTRUCTION OF BINARY SPARSE VECTOR REPRESENTATIONS D. A. Rachkovskij,a† I. S. Misuno,a†† and S. V. Slipchenkoa‡

UDC 004.22+004.93’11

Abstract. Properties of randomized binary vector representations with adjustable sparseness are investigated. Such representations are formed from input vectors by projecting them using a random matrix with ternary elements {-1, 0, +1}. The accuracy of estimating measures of similarity-difference between initial vectors composed of floating-point numbers and output binary vectors is analyzed. The vector representations obtained can be used to efficiently process large arrays of input multidimensional vectors in applications related to searching, classification, associative memory, etc. Keywords: random projection, vector representation, sparse binary representation, distributed representation, efficient similarity estimation. INTRODUCTION. RANDOM PROJECTIONS OF VECTOR INFORMATION Many types of information can be represented in the form of matrices or tables. For example, for information retrieval or classification, document files are often considered as document-word matrices [1, 2] (rows are documents, columns are words, and matrix elements are values of some function of the frequency of occurrence of words in documents). The same information can be considered as an array of vectors or points in a multidimensional space. The dimension of a space is the number of coordinates (the number of words in this example), and each point corresponds to an object (document). The coordinates of such a space are vector components or features. The dimension of a space can amount to hundred of thousands and millions (for example, the number of words in a dictionary), and the number of points amounts to millions and billions (the number of documents, for example, Internet web pages). This stipulates the expediency of transformation of high-dimensional datasets so that objects remain but the efficiency of their processing and/or storage increases. Such a transformation can be realized by passing to a space with corresponding properties, namely, with a dimension smaller than that of the initial space and with computationally more efficient data processing operations or to a space for which there are efficient data processing methods. Many methods and algorithms of information retrieval, classification, clusterization, approximation, case-based learning and reasoning, associative memory, etc. use measures of difference and similarity of vectors such as distance, scalar product, cosine of an angle, ratios of distances and angles, etc. In particular, widely used linear models are based on scalar product, the nearest-neighbor method uses distances, etc. To increase efficiency, it is useful for such methods and algorithms to operate with transformed vector representations and to obtain results consistent with the results in the initial multidimensional vector space. The development and