Parrot: A Progressive Analysis System on Large Text Collections
- PDF / 2,869,699 Bytes
- 19 Pages / 595.276 x 790.866 pts Page_size
- 75 Downloads / 201 Views
Parrot: A Progressive Analysis System on Large Text Collections Yazhong Zhang1,2 · Hanbing Zhang1,2 · Zhenying He1,2 · Yinan Jing1,2 · Kai Zhang1,2 · X. Sean Wang1,2,3 Received: 1 June 2020 / Revised: 20 August 2020 / Accepted: 5 October 2020 © The Author(s) 2020
Abstract The size of textual data continues to grow along with the need for timely and cost-effective analysis, while the growth of computation power cannot keep up with the growth of data. The delays when processing huge textual data can negatively impact user activity and insight. This calls for a paradigm shift from blocking fashion to progressive processing. In this paper, we propose a sample-based progressive processing model that focuses on term frequency calculation on text. The model is based on an incremental execution engine and will calculate a series of approximate results for a single query in a progressive way to provide a smooth trade-off between accuracy and latency. As a part, we proposed a new variant of the bootstrap technique to quantify result error progressively. We implemented this method in our system called Parrot on top of Apache Spark and used real-world data to test its performance. Experiments demonstrate that our method is 2.4×–19.7× faster to get a result within 1% error while the confidence interval always covers the accurate results very well. Keywords Approximate query processing · Text data analytics · Term frequency · Bootstrap
1 Introduction A huge amount of textual data is increasingly produced on the Internet. In twitter, for example, more than 500 million tweets were published per day in 2017.1 These data are of great The preliminary version of this work was published at the International Conference on Database Systems for Advanced Applications (DASFAA) 2020. * Yinan Jing [email protected] Yazhong Zhang [email protected] Hanbing Zhang [email protected] Zhenying He [email protected] Kai Zhang [email protected] X. Sean Wang [email protected] 1
School of Computer Science, Fudan University, Shanghai, China
2
Shanghai Key Laboratory of Data Science, Shanghai, China
3
Shanghai Institute of Intelligent Electronics and Systems, Shanghai, China
analytic values across many fields including hot topic analysis, social public sentiment, etc. Compared to structured data, textual data contains more semantic information such as term frequency and tf-idf whereas existing SQL aggregation functions focused mainly on numerical values, and, thus, are not suitable. And due to the non-correlated relationship between documents, people have much less priori about the distribution of words, especially on a subset. Analyzing textual data through a collection of fixed workload becomes unrealistic. Therefore, the way of interactive exploration becomes popular. The interactive exploration tool gives the user opportunities to continuously approach the final goal by iteratively executing queries using varying predicates [7]. A key requirement of these tools is the ability to provide query results at “human sp
Data Loading...