A study of the performance of general compressors on log files
- PDF / 3,149,421 Bytes
- 43 Pages / 439.642 x 666.49 pts Page_size
- 31 Downloads / 198 Views
A study of the performance of general compressors on log files Kundi Yao1 · Heng Li1 · Weiyi Shang2 · Ahmed E. Hassan1 Published online: 12 August 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Large-scale software systems and cloud services continue to produce a large amount of log data. Such log data is usually preserved for a long time (e.g., for auditing purposes). General compressors, like the LZ77 compressor used in gzip, are usually used in practice to compress log data to reduce the cost of long-term storage. However, such general compressors do not consider the unique nature of log data. In this paper, we study the performance of general compressors on compressing log data relative to their performance on compressing natural language data. We used 12 widely used general compressors to compress nine log files that are collected based on surveying prior literature on text compression, log compression and log analysis. We observe that log data is more repetitive than natural language data, and that log data can be compressed and decompressed faster with higher compression ratios. Besides, the compressor with the highest compression ratio for natural language data is rarely the one for log data. Nevertheless, the compressors with the highest compression ratio for log data are rarely adopted in practice by current logging libraries and log management tools. We also observe that the peak compression and decompression speeds of general compressors on log data is often achieved with a small data size, while such size may not be used by log management tools. Finally, we observe that the optimal compression performance (measured by a combined compression performance score) of log data usually requires the compression level to be configured higher than the default level. Our findings call for careful consideration of choosing general compressors and their associated compression levels for log data in practice. In addition, our findings shed lights on the opportunities for future research on compressors that better suit the characteristics of log data. Keywords Log compression · Software logging · Log management · Language model
Communicated by: Paolo Tonella Kundi Yao
[email protected]
Extended author information available on the last page of the article.
3044
Empirical Software Engineering (2020) 25:3043–3085
1 Introduction Log data is generated by logging statements that developers place into the source code for tracing, debugging and failure diagnosis (Yuan et al. 2010a; Fu et al. 2009; Jiang et al. 2008b; Xu et al. 2009; Mariani and Pastore 2008; Nagaraj et al. 2012; Syer et al. 2013). Log data is usually the only source of information that enables practitioners to understand the field runtime behavior of a system (Yuan et al. 2012; Li et al. 2017; Zhu et al. 2015; Chen and Jiang 2017). Besides, log data and its long-term archival is usually required for legal compliance (Jiang et al. 2008a). As a result, large-scale software systems usually produce a large volume of l
Data Loading...