Anonymization of System Logs for Preserving Privacy and Reducing Storage
System logs constitute valuable information for analysis and diagnosis of systems behavior. The analysis is highly time-consuming for large log volumes. For many parallel computing centers, outsourcing the analysis of system logs (syslogs) to third partie
- PDF / 1,502,604 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 7 Downloads / 212 Views
Technical University of Dresden, Dresden, Germany [email protected] 2 University of Basel, Basel, Switzerland [email protected]
Abstract. System logs constitute valuable information for analysis and diagnosis of systems behavior. The analysis is highly time-consuming for large log volumes. For many parallel computing centers, outsourcing the analysis of system logs (syslogs) to third parties is the only option. Therefore, a general analysis and diagnosis solution is needed. Such a solution is possible only through the syslog analysis from multiple computing systems. The data within syslogs can be sensitive, thus obstructing the sharing of syslogs across institutions, third party entities, or in the public domain. This work proposes a new method for the anonymization of syslogs that employs de-identification and encoding to provide fully shareable system logs. In addition to eliminating the sensitive data within the test logs, the proposed anonymization method provides 25% performance improvement in post-processing of the anonymized syslogs, and more than 80% reduction in their required storage space. Keywords: Privacy · Anonymization · Encoding · System logs Data quality · Size reduction · Performance improvement
1
Introduction
System logs are valuable sources of information for the analysis and diagnosis of system behavior. The size of computing systems and the number of their components, continually increase. The volume of generated system logs (hereafter, syslogs) is in proportion to this increase. The storage of the syslogs produced by large parallel computing systems in view of their analysis requires high storage capacity. Moreover, the existence of sensitive data within the syslogs raises serious concerns about their storage, analysis, dissemination, and publication. The anonymization of syslogs is a mean to address the second challenge. During the process of anonymization, the sensitive information will be eliminated while the remaining data is considered as cleansed data. To the best of our knowledge, no existing automatic anonymization method guarantees full user privacy. This is c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 162–179, 2019. https://doi.org/10.1007/978-3-030-03405-4_11
Anonymization of System Logs for Preserving Privacy and Reducing Storage
163
due to the fact that there is always a small probability that sensitive data leaks into the cleansed data. Applying anonymization methods to syslogs to cleanse the sensitive data before storage, analysis, sharing, or publication, reduces the usability of the anonymized syslogs for further analysis. After a certain degree of anonymization, the cleansed syslog entries lose their significance and only remain useful for statistical analysis, such as time series and distributions. At this stage, it is possible to encode long syslog entries into shorter strings. Encoding significantly reduces the required storage capacity of syslogs and addresses the storage challenge mentioned earlier. Sh
Data Loading...