TADOC: Text analytics directly on compression

  • PDF / 1,660,324 Bytes
  • 26 Pages / 595.276 x 790.866 pts Page_size
  • 5 Downloads / 185 Views

DOWNLOAD

REPORT


REGULAR PAPER

TADOC: Text analytics directly on compression Feng Zhang1 · Jidong Zhai2 · Xipeng Shen3 · Dalin Wang1 · Zheng Chen1 · Onur Mutlu4 · Wenguang Chen2 · Xiaoyong Du1 Received: 8 October 2019 / Revised: 21 July 2020 / Accepted: 2 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract This article provides a comprehensive description of text analytics directly on compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times. Keywords Text analytics · Document analytics · Compression · Sequitur

1 Introduction Document analytics refers to data analytics tasks that derive statistics, patterns, insights, or knowledge from textual docu-

B

Feng Zhang [email protected] Jidong Zhai [email protected] Xipeng Shen [email protected] Dalin Wang [email protected] Zheng Chen [email protected] Onur Mutlu [email protected] Wenguang Chen [email protected] Xiaoyong Du [email protected]

1

Key Laboratory of Data Engineering and Knowledge Engineering (MOE), School of Information, Renmin University of China, Beijing, China

2

Department of Computer Science and Technology, Tsinghua University, Beijing, China

3

Computer Science Department, North Carolina State University, Raleigh, USA

4

Department of Computer Science, ETH Zürich, Zürich, Switzerland

ments (e.g., system log files, emails, corpus). It is important for many applications, from web search to system diagnosis, security, and so on. Document analytics applications are time-consuming, especially as the data they process keep growing rapidly. At the same time, they often need a large amount of space, both in storage and memory. A common approach to mitigating the space concern is data compression. Although it often reduces the storage usage by several factors, compression does not alleviate, but actually worsens, the time concern. In current document analytics frameworks, compressed documents have to be decompressed before being processed. The decompression step lengthens the end-to-end processing time. This work investigates the feasibility of efficient data analytics on compressed data without decompressing it. Its motivation is twofold. First, it could avoid the decompression time. Second, more importantly, it could save some processing. Space savings by compression fundamentally stem from repetitions in the data. If the analytics algorithms could leverage the repetitions that the compression algorithm already uncovers, it could avoid unnecessary repeated processing, and hence shorten the