Run-Length-Encoded Compression Scheme

Mining large datasets in a compressed domain are an interesting direction in data mining. In this chapter, we propose a nonlossy compression scheme. It is based on run-length encoding of binary-valued features or floating-point-valued features that are ap

  • PDF / 271,762 Bytes
  • 20 Pages / 441 x 666 pts Page_size
  • 53 Downloads / 217 Views

DOWNLOAD

REPORT


Run-Length-Encoded Compression Scheme

3.1 Introduction Data Mining deals with a large number of patterns of high dimension. While dealing with such data, a number of factors become important such as size of data, dimensionality of each pattern, number of scans of database, storage of entire data, storage of derived summary of information, computations involved on entire data that lead to summary of information, etc. In the current chapter, we propose compression algorithms that work on patterns with binary-valued features. However, the algorithms are applicable to floating-point-valued features and are appropriately quantized into a binary-valued feature set. Conventional methods of data reduction include clustering, sampling, use of sufficient statistics or other derived information from the data. For clustering and classification of such large data, the computational effort and storage space required would be prohibitively large. In structural representation of patterns, string matching is carried out using an edit distance and the longest common subsequence. One possibility of dealing with such patterns is to represent them as runs. Efficient algorithms exist to compute approximate and exact edit distances of run-length-encoded strings. In the current chapter, we focus on numerical similarity measure. We propose a novel idea of compressing the binary data and carry out clustering and classification directly on such compressed data. We use run-length encoding and demonstrate that such compression reduces both storage space and computation time. The work is directly applicable to mining of large-scale business transactions. Major contribution of the idea is in developing a scheme wherein a number of goals are achieved such as reduced storage space, computation of distance function on run-length-encoded binary patterns without having to decompress, and preserving the same classification accuracy that could be obtained using original uncompressed data and significantly reduced processing time. We discuss theoretical foundations for such an idea, as well as practical implementation results.

T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets, Advances in Computer Vision and Pattern Recognition, DOI 10.1007/978-1-4471-5607-9_3, © Springer-Verlag London 2013

47

48

3

Run-Length-Encoded Compression Scheme

3.2 Compression Domain for Large Datasets Data Mining deals with large datasets. The datasets can be formal databases, data warehouses, flat files, etc. From the Pattern Recognition perspective, a “Large Dataset” can be defined as a set of patterns that are not amenable for in-memory storage and operations. Large Dataset Let n and d represent the number of patterns and the number of features, respectively. Largeness of data can be defined as one of the following. • n is small, and d is large • n is large, and d is small • n is large, and d is large. Algorithms needing multiple data scans result in high processing requirements. This motivates one to look for conventional and hybrid algorithms t