TADOC: Text analytics directly on compression

PDF / 1,660,324 Bytes
26 Pages / 595.276 x 790.866 pts Page_size
5 Downloads / 197 Views

REGULAR PAPER

TADOC: Text analytics directly on compression Feng Zhang1 · Jidong Zhai2 · Xipeng Shen3 · Dalin Wang1 · Zheng Chen1 · Onur Mutlu4 · Wenguang Chen2 · Xiaoyong Du1 Received: 8 October 2019 / Revised: 21 July 2020 / Accepted: 2 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract This article provides a comprehensive description of text analytics directly on compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times. Keywords Text analytics · Document analytics · Compression · Sequitur

1 Introduction Document analytics refers to data analytics tasks that derive statistics, patterns, insights, or knowledge from textual docu-

B

Feng Zhang [email protected] Jidong Zhai [email protected] Xipeng Shen [email protected] Dalin Wang [email protected] Zheng Chen [email protected] Onur Mutlu [email protected] Wenguang Chen [email protected] Xiaoyong Du [email protected]

1

Key Laboratory of Data Engineering and Knowledge Engineering (MOE), School of Information, Renmin University of China, Beijing, China

2

Department of Computer Science and Technology, Tsinghua University, Beijing, China

3

Computer Science Department, North Carolina State University, Raleigh, USA

4

Department of Computer Science, ETH Zürich, Zürich, Switzerland

ments (e.g., system log files, emails, corpus). It is important for many applications, from web search to system diagnosis, security, and so on. Document analytics applications are time-consuming, especially as the data they process keep growing rapidly. At the same time, they often need a large amount of space, both in storage and memory. A common approach to mitigating the space concern is data compression. Although it often reduces the storage usage by several factors, compression does not alleviate, but actually worsens, the time concern. In current document analytics frameworks, compressed documents have to be decompressed before being processed. The decompression step lengthens the end-to-end processing time. This work investigates the feasibility of efficient data analytics on compressed data without decompressing it. Its motivation is twofold. First, it could avoid the decompression time. Second, more importantly, it could save some processing. Space savings by compression fundamentally stem from repetitions in the data. If the analytics algorithms could leverage the repetitions that the compression algorithm already uncovers, it could avoid unnecessary repeated processing, and hence shorten the

Data Loading...

TADOC: Text analytics directly on compression

Recommend Documents

Text Analytics

Text Compression

Introduction to Text Analytics

Text Index Compression

A Study on Implementation of Text Analytics over Legal Domain

Natural Language Processing (NLP) and Text Analytics

Techniques of Czech Language Lossless Text Compression

Classification of Technical Debts in Software Development Using Text Analytics

A dictionary-based text compression technique using quaternary code

Directly Observed Therapy Strategy

Data Analytics on Public Cloud

The Influence of Text Length on Text Classification Model