A Proposal: High-Throughput Robust Architecture for Log Analysis and Data Stream Mining

Various data mining approaches are now available, which help in handling large static data sets, in spite of limited computational resources. However, these approaches lack in mining high-speed endless streams, as their learning procedure though simple re

  • PDF / 166,172 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 20 Downloads / 171 Views

DOWNLOAD

REPORT


Abstract Various data mining approaches are now available, which help in handling large static data sets, in spite of limited computational resources. However, these approaches lack in mining high-speed endless streams, as their learning procedure though simple require the entire training process to be repeated for each new arriving information instance. The main challenges while dealing with continuous data streams: they are of sizes many times greater than the available memory, are real-time, and the new instances should be inspected at most once, and predictions must be made. Another issue with continuous real-time data is changing of concepts with time, which is often called concept drift. This paper addresses the above stated problems, and provides a solution by proposing a real-time, scalable, and robust architecture. It is a general-purpose architecture, based on online machine learning, which efficiently logs and mines the stream data in a fault-tolerant manner. It consists of two frameworks: (1) Event aggregation framework, which reliably collects events and messages from multiple sources and ships them to a destination for processing (2) Real-time computation framework, which processes streams online for extraction of information patterns. It guarantees reliable processing of billions of messages per second. Furthermore, it facilitates the evaluation of the stream learning algorithms and offers change detection strategies to detect concept drifts.

A.R. Hussain (&) Research & Development, Host Analytics Sofwtare Pvt. Ltd., Hyderabad 500 081, AP, India e-mail: [email protected] M.A. Hameed Department of Computer Science, University College of Engineering, Osmania University, Hyderabad, India e-mail: [email protected] S. Fatima Department of Computer Science, M.J College of Engineering and Technology, Hyderabad, India e-mail: [email protected] © Springer Science+Business Media Singapore 2016 H.S. Saini et al. (eds.), Innovations in Computer Science and Engineering, Advances in Intelligent Systems and Computing 413, DOI 10.1007/978-981-10-0419-3_36

305

306

A.R. Hussain et al.





Keywords Online Throughput Machine learning analysis Concept drift Real-time Robust









Stream mining



Log

1 Introduction A growing number of emerging business and scientific apps like satellite radar, stock market, transaction web log, real-time surveillance systems, telecommunication systems, sensor networks [1, 2], and other dynamic environments generate massive amounts of data. This continuously generated real-time, unbounded sequence of data called as a data stream [1–4]. In last decade, much research attention has been given to log processing and mining of data streams. It is demanding to mine streams as it helps in extraction of important knowledge, which is necessary to take crucial decisions in real-time. However, log analysis and extraction of information structures as models and patterns may pose many challenges such as storage, computational, and querying. Due to huge memory requirements and h