A novel online ensemble approach to handle concept drifting data streams: diversified dynamic weighted majority
- PDF / 4,867,431 Bytes
- 25 Pages / 595.276 x 790.866 pts Page_size
- 71 Downloads / 209 Views
ORIGINAL ARTICLE
A novel online ensemble approach to handle concept drifting data streams: diversified dynamic weighted majority Parneeta Sidhu • M. P. S. Bhatia
Received: 15 July 2014 / Accepted: 16 January 2015 Springer-Verlag Berlin Heidelberg 2015
Abstract We present an online ensemble approach, diversified dynamic weighted majority (DDWM) to classify new data instances which have varying conceptual distributions. Our approach maintains two sets of weighted ensembles that differentiate in their level of diversity. An expert in either of the ensembles is updated or removed as per its classification accuracy and a new expert is added based on the final global prediction of the algorithm and the global prediction of the ensemble for any data instance. Experimental evaluation using various artificial and realworld datasets proves that DDWM provides very high accuracy in classifying new data instances, irrespective of size of dataset, type of drift or presence of noise. We compare DDWM with the other learners in terms of new performance metrics such as kappa statistic, model cost, and the evaluation time and memory requirements. Our approach proved to be highly resource effective achieving very high accuracies even in a resource constrained environment. Keywords Concept drift Ensemble Diversity Data stream Online learning
1 Introduction Data stream mining is a very important research area in machine learning community. It is the process of studying P. Sidhu (&) M. P. S. Bhatia Division of CoE, Netaji Subhas Institute of Technology, Sec-3 Dwarka, New Delhi 110078, India e-mail: [email protected] M. P. S. Bhatia e-mail: [email protected]
the concept underlying the data and the variations in that concept to classify new data instances with higher accuracy. Data streams differ from the static databases as they may have varying concepts underlying the data, unlimited size, high speed and high dimensionality [52]. We can access a data instance in a data stream only ‘‘once’’ when it arrives, after that the given instance is replaced by a new instance which may have a different conceptual distribution. ‘Concept’ for a data instance refers to the underlying data distribution, illustrated by the joint distribution [1], p(x, y) where x represents the n-dimensional feature vector and y represents its class label. The term ‘concept drift’ refers to change in the underlying conceptual distribution [6, 7, 15] as new instances arrive for example in various applications like Market-Basket analysis [10], computer security, internet data, credit fraud detection, bioinformatics etc. In Market-Basket analysis, similar concept is seen in the customer buying behavior each year during Christmas festivity. This pattern re-occurs every year (i.e. recurrent drift), resulting in a drift from the customer’s last month buying pattern. A drift present in a dataset is measured by its severity and speed. Severity represents the amount of changes caused by a new concept. Speed is the inverse of the time taken for a new concept
Data Loading...