Parallel and Distributed Data Mining in Cloud
The paper describes the approach for a distributed execution of data mining algorithms and using this approach for building a Cloud for Data Mining. The suggested approach allows us to execute data mining algorithms in different parallel and distributed e
- PDF / 806,716 Bytes
- 14 Pages / 439.37 x 666.142 pts Page_size
- 90 Downloads / 209 Views
stract. The paper describes the approach for a distributed execution of data mining algorithms and using this approach for building a Cloud for Data Mining. The suggested approach allows us to execute data mining algorithms in different parallel and distributed environments. Thus, the created Cloud for Data Mining can be used as an analytic service and a platform for research and debugging parallel and distributed data mining algorithms. Keywords: Data mining Big data Parallel algorithms Cloud for Data Mining
Cloud computing
1 Introduction We can observe rapid growth of data volume at present. Data are collected in information systems, generated by different devices, saved as logs of computer system’s work, etc. Modern data warehouses provide storage of large amounts of different data. The terms “Big data”, “Internet of Things”, and “Cloud computing” have recently become very popular. They mean technologies for collecting, storing, and handling the large volumes of data, with a variety of types and a high velocity of generation (Big data). However, all this is worthless if we do not analyze and do not obtain new knowledge from the data. The technologies like Machine Learning, Data Mining, and Knowledge Discovery are used for discovering new knowledge in data. They use complex mathematical methods and algorithms that need powerful computing resources for analyzing Big data. Cloud and cluster technologies provide unlimited (scalable) resources. Integration of data mining and cloud computing technologies is very important. The result of this integration is a creation of Cloud for Data Mining (CDM). This solution has a number of advantages: – – – –
users always have the latest version of the algorithm; algorithms can use all the computational resources available in the “cloud”; algorithms can be applied to the data stored in the “cloud” and outside of it; users user can forget about scaling algorithms.
© Springer International Publishing Switzerland 2016 P. Perner (Ed.): ICDM 2016, LNAI 9728, pp. 349–362, 2016. DOI: 10.1007/978-3-319-41561-1_26
350
I. Kholod et al.
The CDM should have the following features for a comfortable work of an analyst: – – – – – –
usability of a multiuser interface; execution of a full cycle of analysis; access to inner and outer data sources; work with different data sources; using of all computing resources; a wide range of data mining algorithms and others.
Additionally, the CDM must provide the following capabilities for data mining researchers and developers: – unified API; – ability to add new data mining algorithms; – a wide range of parallel and distributed environments. The paper describes an approach and architecture of the CDM that has these capabilities. The paper is organized as follows. The next section is a review of similar the CDM systems. The Sect. 3 contains the description of a general approach that allows mapping of the algorithm decomposed into blocks on different distributed systems. The Sect. 4 describes the CDM architecture. The last section discusses ex
Data Loading...