A two ensemble system to handle concept drifting data streams: recurring dynamic weighted majority

  • PDF / 4,270,077 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 55 Downloads / 226 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

A two ensemble system to handle concept drifting data streams: recurring dynamic weighted majority Parneeta Sidhu1 · M. P. S. Bhatia1 

Received: 4 August 2016 / Accepted: 26 October 2017 © Springer-Verlag GmbH Germany 2017

Abstract  We present an ensemble system, recurring dynamic weighted majority (RDWM) that maintains two ensembles of experts, so as to accurately handle drifting concepts mainly recurrent drifts. The primary online ensemble represents the present concepts and the secondary ensemble represents the old concepts since the beginning of learning. An effective pruning methodology helps to remove redundant and old classifiers, which may have otherwise caused interference in learning the new concepts. Experimental evaluation using datasets proves that RDWM achieves very high generalization accuracy, irrespective of the speed or severity of drift; or presence of noise in the dataset. Keywords  Concept drift · Ensemble · Recurrent · Data stream

1 Introduction Mining large streams of data is an upcoming area of research in the machine learning community. Data stream mining is the process of understanding the underlying concepts in data and analyzing drifts [3, 6, 32], so as to accurately classify the new instances. A drift could be sudden, gradual, recurring, or incremental. Sudden change is observed when the concept changes from one class to another within a single time step. Gradual change occurs when the new concept emerges * Parneeta Sidhu [email protected] M. P. S. Bhatia [email protected] 1



Division of CoE, Netaji Subhas Institute of Technology, Sec‑3, Dwarka, New Delhi 110078, India

gradually over time. A change is said to be recurrent if an old concept reappears after some time. The drift is incremental if any two consecutive concepts are almost similar and the drift is felt only after a longer time period. Further, a drift can also measured by its severity and speed. Severity represents the amount of changes caused by a new concept and speed is the inverse of the total time taken for a new concept to completely replace the old concept. Various applications where drifts have been observed are Market-Basket analysis [12], computer security, medical diagnosis etc. Online approaches [1, 4, 6, 12, 16, 18, 26, 37] process each instance only “once” on arrival without storing it for further processing. These can be categorized as: approaches that explicitly use a mechanism to handle drifts [1, 6, 18]; and that does not explicitly use a mechanism for drift detection [4, 12]. Online approaches may either be a single classifier; or a single ensemble; or an active classifier and a set of weighted classifier systems. None of the existing systems maintain more than one ensemble in its model. It has been studied that an ensemble of classifiers [5, 7, 11, 35, 38] provides higher generalization accuracy [3, 30, 36] as compared to a single classifier system. Hence, we have proposed Recurring Dynamic Weighted Majority system (RDWM) that maintains two ensembles: a primary online ensemble